Robust detection and identification of sparse segments in ultrahigh dimensional data analysis

被引:19
作者
Cai, T. Tony [1 ]
Jeng, X. Jessie [1 ]
Li, Hongzhe [1 ]
机构
[1] Univ Penn, Sch Med, Dept Biostat & Epidemiol, Philadelphia, PA 19104 USA
基金
美国国家科学基金会; 美国国家卫生研究院;
关键词
DNA copy number variant; Next generation sequencing data; Optimality; Robust segment detector; Robust segment identifier; COPY-NUMBER VARIATION; STRUCTURAL VARIATION; RESOLUTION; SEQ; ALGORITHM;
D O I
10.1111/j.1467-9868.2012.01028.x
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
. Copy number variants (CNVs) are alternations of DNA of a genome that result in the cell having less or more than two copies of segments of the DNA. CNVs correspond to relatively large regions of the genome, ranging from about one kilobase to several megabases, that are deleted or duplicated. Motivated by CNV analysis based on next generation sequencing data, we consider the problem of detecting and identifying sparse short segments hidden in a long linear sequence of data with an unspecified noise distribution. We propose a computationally efficient method that provides a robust and near optimal solution for segment identification over a wide range of noise distributions. We theoretically quantify the conditions for detecting the segment signals and show that the method near optimally estimates the signal segments whenever it is possible to detect their existence. Simulation studies are carried out to demonstrate the efficiency of the method under various noise distributions. We present results from a CNV analysis of a HapMap Yoruban sample to illustrate the theory and the methods further.
引用
收藏
页码:773 / 797
页数:25
相关论文
共 36 条
  • [1] CNVnator: An approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing
    Abyzov, Alexej
    Urban, Alexander E.
    Snyder, Michael
    Gerstein, Mark
    [J]. GENOME RESEARCH, 2011, 21 (06) : 974 - 984
  • [2] APPLICATIONS OF NEXT-GENERATION SEQUENCING Genome structural variation discovery and genotyping
    Alkan, Can
    Coe, Bradley P.
    Eichler, Evan E.
    [J]. NATURE REVIEWS GENETICS, 2011, 12 (05) : 363 - 375
  • [3] Near-optimal detection of geometric objects by fast multiscale methods
    Arias-Castro, E
    Donoho, DL
    Huo, XM
    [J]. IEEE TRANSACTIONS ON INFORMATION THEORY, 2005, 51 (07) : 2402 - 2425
  • [4] Model-Based Quality Assessment and Base-Calling for Second-Generation Sequencing Data
    Bravo, Hector Corrada
    Irizarry, Rafael A.
    [J]. BIOMETRICS, 2010, 66 (03) : 665 - 674
  • [5] ROBUST NONPARAMETRIC ESTIMATION VIA WAVELET MEDIAN REGRESSION
    Brown, Lawrence D.
    Cai, T. Tony
    Zhou, Harrison H.
    [J]. ANNALS OF STATISTICS, 2008, 36 (05) : 2055 - 2084
  • [6] Optimal detection of heterogeneous and heteroscedastic mixtures
    Cai, T. Tony
    Jeng, X. Jessie
    Jin, Jiashun
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 2011, 73 : 629 - 662
  • [7] ASYMPTOTIC EQUIVALENCE AND ADAPTIVE ESTIMATION FOR ROBUST NONPARAMETRIC REGRESSION
    Cai, T. Tony
    Zhou, Harrison H.
    [J]. ANNALS OF STATISTICS, 2009, 37 (6A) : 3204 - 3235
  • [8] Chen K, 2009, NAT METHODS, V6, P677, DOI [10.1038/NMETH.1363, 10.1038/nmeth.1363]
  • [9] Systematic bias in high-throughput sequencing data and its correction by BEADS
    Cheung, Ming-Sin
    Down, Thomas A.
    Latorre, Isabel
    Ahringer, Julie
    [J]. NUCLEIC ACIDS RESEARCH, 2011, 39 (15) : e103
  • [10] High-resolution mapping of copy-number alterations with massively parallel sequencing
    Chiang, Derek Y.
    Getz, Gad
    Jaffe, David B.
    O'Kelly, Michael J. T.
    Zhao, Xiaojun
    Carter, Scott L.
    Russ, Carsten
    Nusbaum, Chad
    Meyerson, Matthew
    Lander, Eric S.
    [J]. NATURE METHODS, 2009, 6 (01) : 99 - 103