Robust detection and identification of sparse segments in ultrahigh dimensional data analysis

被引:19
作者
Cai, T. Tony [1 ]
Jeng, X. Jessie [1 ]
Li, Hongzhe [1 ]
机构
[1] Univ Penn, Sch Med, Dept Biostat & Epidemiol, Philadelphia, PA 19104 USA
基金
美国国家科学基金会; 美国国家卫生研究院;
关键词
DNA copy number variant; Next generation sequencing data; Optimality; Robust segment detector; Robust segment identifier; COPY-NUMBER VARIATION; STRUCTURAL VARIATION; RESOLUTION; SEQ; ALGORITHM;
D O I
10.1111/j.1467-9868.2012.01028.x
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
. Copy number variants (CNVs) are alternations of DNA of a genome that result in the cell having less or more than two copies of segments of the DNA. CNVs correspond to relatively large regions of the genome, ranging from about one kilobase to several megabases, that are deleted or duplicated. Motivated by CNV analysis based on next generation sequencing data, we consider the problem of detecting and identifying sparse short segments hidden in a long linear sequence of data with an unspecified noise distribution. We propose a computationally efficient method that provides a robust and near optimal solution for segment identification over a wide range of noise distributions. We theoretically quantify the conditions for detecting the segment signals and show that the method near optimally estimates the signal segments whenever it is possible to detect their existence. Simulation studies are carried out to demonstrate the efficiency of the method under various noise distributions. We present results from a CNV analysis of a HapMap Yoruban sample to illustrate the theory and the methods further.
引用
收藏
页码:773 / 797
页数:25
相关论文
共 36 条
[1]   CNVnator: An approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing [J].
Abyzov, Alexej ;
Urban, Alexander E. ;
Snyder, Michael ;
Gerstein, Mark .
GENOME RESEARCH, 2011, 21 (06) :974-984
[2]   APPLICATIONS OF NEXT-GENERATION SEQUENCING Genome structural variation discovery and genotyping [J].
Alkan, Can ;
Coe, Bradley P. ;
Eichler, Evan E. .
NATURE REVIEWS GENETICS, 2011, 12 (05) :363-375
[3]   Near-optimal detection of geometric objects by fast multiscale methods [J].
Arias-Castro, E ;
Donoho, DL ;
Huo, XM .
IEEE TRANSACTIONS ON INFORMATION THEORY, 2005, 51 (07) :2402-2425
[4]   Model-Based Quality Assessment and Base-Calling for Second-Generation Sequencing Data [J].
Bravo, Hector Corrada ;
Irizarry, Rafael A. .
BIOMETRICS, 2010, 66 (03) :665-674
[5]   ROBUST NONPARAMETRIC ESTIMATION VIA WAVELET MEDIAN REGRESSION [J].
Brown, Lawrence D. ;
Cai, T. Tony ;
Zhou, Harrison H. .
ANNALS OF STATISTICS, 2008, 36 (05) :2055-2084
[6]   Optimal detection of heterogeneous and heteroscedastic mixtures [J].
Cai, T. Tony ;
Jeng, X. Jessie ;
Jin, Jiashun .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 2011, 73 :629-662
[7]   ASYMPTOTIC EQUIVALENCE AND ADAPTIVE ESTIMATION FOR ROBUST NONPARAMETRIC REGRESSION [J].
Cai, T. Tony ;
Zhou, Harrison H. .
ANNALS OF STATISTICS, 2009, 37 (6A) :3204-3235
[8]  
Chen K, 2009, NAT METHODS, V6, P677, DOI [10.1038/NMETH.1363, 10.1038/nmeth.1363]
[9]   Systematic bias in high-throughput sequencing data and its correction by BEADS [J].
Cheung, Ming-Sin ;
Down, Thomas A. ;
Latorre, Isabel ;
Ahringer, Julie .
NUCLEIC ACIDS RESEARCH, 2011, 39 (15) :e103
[10]   High-resolution mapping of copy-number alterations with massively parallel sequencing [J].
Chiang, Derek Y. ;
Getz, Gad ;
Jaffe, David B. ;
O'Kelly, Michael J. T. ;
Zhao, Xiaojun ;
Carter, Scott L. ;
Russ, Carsten ;
Nusbaum, Chad ;
Meyerson, Matthew ;
Lander, Eric S. .
NATURE METHODS, 2009, 6 (01) :99-103