
Title
Optimal Sparse Segment Identification: Theory and Applications in CNV Analysis
Speaker
Jessie Jeng, University of Pennsylvania
Abstract
Motivated by DNA copy number variation (CNV) analysis based on high-density single nucleotide polymorphism (SNP) data, we consider the problem of identifying sparse short segments in a long sequence of noisy observations, where the number, length and location of the segments are unknown. We present a statistical characterization of the identifiable region of a segment where it is possible to reliably separate the segment from noise. An effcient likelihood ratio selection (LRS) procedure is developed, and is shown to be asymptotically optimal in the sense that the LRS can separate the signal segments from the noise as long as the signal segments are in the identifiable regions. The problem is further studied in the setting where a set of aligned sequences of observations is available. Signal segments are characterized into rare and common groups according to their carrier's proportions across sequences. A proportion adaptive segmentation (PAS) procedure is proposed, and its asymptotic optimality is presented for detecting both rare and common segments. Both LRS and PAS are demonstrated via simulations and CNV analysis on high-density SNP data. The results show that the proposed methods can yield greater gain in power for detecting the true segments than some standard signal identification procedures.