Comparing Segmentation Methods for Genome Annotation Based on RNA-Seq Data

被引:9
作者
Cleynen, Alice [1 ,2 ,3 ,4 ]
Dudoit, Sandrine [3 ,4 ]
Robin, Stephane [1 ,2 ]
机构
[1] AgroParisTech, UMR 518, F-75231 Paris 05, France
[2] INRA, UMR 518, F-75231 Paris 05, France
[3] Univ Calif Berkeley, Div Biostat, Berkeley, CA 94720 USA
[4] Univ Calif Berkeley, Dept Stat, Berkeley, CA 94720 USA
关键词
Change-point detection; Confidence intervals; Count data; Genome annotation; Negative binomial distribution; RNA-Seq; Segmentation; GC-CONTENT NORMALIZATION; ARRAY CGH DATA;
D O I
10.1007/s13253-013-0159-5
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Transcriptome sequencing (RNA-Seq) yields massive data sets, containing a wealth of information on the expression of a genome. While numerous methods have been developed for the analysis of differential gene expression, little has been attempted for the localization of transcribed regions, that is, segments of DNA that are transcribed and processed to result in a mature messenger RNA. Our understanding of genomes, mostly annotated from biological experiments or computational gene prediction methods, could benefit greatly from re-annotation using the high precision of RNA-Seq. We consider five classes of genome segmentation methods to delineate transcribed regions, including intron/exon boundaries, based on RNA-Seq data. The methods provide different functionality and include both exact and heuristic approaches, using diverse models, such as hidden Markov or Bayesian models, and diverse algorithms, such as dynamic programming or the forward-backward algorithm. We evaluate the methods in a simulation study where RNA-Seq read counts are generated from parametric models as well as by resampling of actual yeast RNA-Seq data. The methods are compared in terms of criteria that include global and local fit to a reference segmentation, Receiver Operator Characteristic (ROC) curves, and coverage of credibility intervals based on posterior change-point distributions. All compared algorithms are implemented in packages available on the Comprehensive R Archive Network (CRAN, http://cran.r-project.org). The data set used in the simulation study is publicly available from the Sequence Read Archive (SRA, http://www.ncbi.nlm.nih.gov/sra). While the different methods each have pros and cons, our results suggest that the EBS Bayesian approach of Rigaill, Lebarbier, and Robin (2012) performs well in a re-annotation context, as illustrated in the simulation study and in the application to actual yeast RNA-Seq data. This article has supplementary material online.
引用
收藏
页码:101 / 118
页数:18
相关论文
共 20 条
  • [1] Arlot S., 2010, STAT COMPUT, P1
  • [2] Computation and analysis of multiple structural change models
    Bai, J
    Perron, P
    [J]. JOURNAL OF APPLIED ECONOMETRICS, 2003, 18 (01) : 1 - 22
  • [3] A BAYESIAN-ANALYSIS FOR CHANGE POINT PROBLEMS
    BARRY, D
    HARTIGAN, JA
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1993, 88 (421) : 309 - 319
  • [4] Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization
    Boeva, Valentina
    Zinovyev, Andrei
    Bleakley, Kevin
    Vert, Jean-Philippe
    Janoueix-Lerosey, Isabelle
    Delattre, Olivier
    Barillot, Emmanuel
    [J]. BIOINFORMATICS, 2011, 27 (02) : 268 - 269
  • [5] Breiman L., 1984, Classification and regression trees, V1st ed, DOI [10.1201/9781315139470, DOI 10.1201/9781315139470/CLASSIFICATION-REGRESSION-TREES-LEO-BREIMAN, 10.1201/9781315139470/classification-regression-trees-leo-breiman]
  • [6] Cleynen A., ARXIV13012534
  • [7] Cleynen A., ARXIV12045564
  • [8] PARTITION REGRESSION
    GUTHERY, SB
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1974, 69 (348) : 945 - 947
  • [9] Denoising array-based comparative genomic hybridization data using wavelets
    Hsu, L
    Self, SG
    Grove, D
    Randolph, T
    Wang, K
    Delrow, JJ
    Loo, L
    Porter, P
    [J]. BIOSTATISTICS, 2005, 6 (02) : 211 - 226
  • [10] Analysis of array CGH data:: from signal ratio to gain and loss of DNA regions
    Hupé, P
    Stransky, N
    Thiery, JP
    Radvanyi, F
    Barillot, E
    [J]. BIOINFORMATICS, 2004, 20 (18) : 3413 - 3422