DAFS: a data-adaptive flag method for RNA-sequencing data to differentiate genes with low and high expression

被引:20
作者
George, Nysia I. [1 ]
Chang, Ching-Wei [1 ]
机构
[1] US FDA, Natl Ctr Toxicol Res, Div Bioinformat & Biostat, Jefferson, AR 72079 USA
来源
BMC BIOINFORMATICS | 2014年 / 15卷
关键词
RNA-sequencing; Low expression; Data-adaptive; Flag; Mixture distribution; MESSENGER-RNA; ABUNDANCE; MODEL;
D O I
10.1186/1471-2105-15-92
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Next-generation sequencing (NGS) has advanced the application of high-throughput sequencing technologies in genetic and genomic variation analysis. Due to the large dynamic range of expression levels, RNA-seq is more prone to detect transcripts with low expression. It is clear that genes with no mapped reads are not expressed; however, there is ongoing debate about the level of abundance that constitutes biologically meaningful expression. To date, there is no consensus on the definition of low expression. Since random variation is high in regions with low expression and distributions of transcript expression are affected by numerous experimental factors, methods to differentiate low and high expressed data in a sample are critical to interpreting classes of abundance levels in RNA-seq data. Results: A data-adaptive approach was developed to estimate the lower bound of high expression for RNA-seq data. The Kolmgorov-Smirnov statistic and multivariate adaptive regression splines were used to determine the optimal cutoff value for separating transcripts with high and low expression. Results from the proposed method were compared to results obtained by estimating the theoretical cutoff of a fitted two-component mixture distribution. The robustness of the proposed method was demonstrated by analyzing different RNA-seq datasets that varied by sequencing depth, species, scale of measurement, and empirical density shape. Conclusions: The analysis of real and simulated data presented here illustrates the need to employ data-adaptive methodology in lieu of arbitrary cutoffs to distinguish low expressed RNA-seq data from high expression. Our results also present the drawbacks of characterizing the data by a two-component mixture distribution when classes of gene expression are not well separated. The ability to ascertain stably expressed RNA-seq data is essential in the filtering process of data analysis, and methodologies that consider the underlying data structure demonstrate superior performance in preserving most of the interpretable and meaningful data. The proposed algorithm for classifying low and high regions of transcript abundance promises wide-range application in the continuing development of RNA-seq analysis.
引用
收藏
页数:11
相关论文
共 39 条
[1]   Differential expression analysis for sequence count data [J].
Anders, Simon ;
Huber, Wolfgang .
GENOME BIOLOGY, 2010, 11 (10)
[2]   Assessing a mixture model for clustering with the integrated completed likelihood [J].
Biernacki, C ;
Celeux, G ;
Govaert, G .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2000, 22 (07) :719-725
[3]   SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation [J].
Blewitt, Marnie E. ;
Gendrel, Anne-Valerie ;
Pang, Zhenyi ;
Sparrow, Duncan B. ;
Whitelaw, Nadia ;
Craig, Jeffrey M. ;
Apedaile, Anwyn ;
Hilton, Douglas J. ;
Dunwoodie, Sally L. ;
Brockdorff, Neil ;
Kay, Graham F. ;
Whitelaw, Emma .
NATURE GENETICS, 2008, 40 (05) :663-669
[4]  
Casella G., 2001, STAT INFERENCE
[5]  
Chang CW, 2008, J DATA SCI, V4, P415
[6]   Statistical methods on detecting differentially expressed genes for RNA-seq data [J].
Chen, Zhongxue ;
Liu, Jianzhong ;
Ng, Hon Keung Tony ;
Nadarajah, Saralees ;
Kaufman, Howard L. ;
Yang, Jack Y. ;
Deng, Youping .
BMC SYSTEMS BIOLOGY, 2011, 5
[7]   The transcriptional diversity of 25 Drosophila cell lines [J].
Cherbas, Lucy ;
Willingham, Aarron ;
Zhang, Dayu ;
Yang, Li ;
Zou, Yi ;
Eads, Brian D. ;
Carlson, Joseph W. ;
Landolin, Jane M. ;
Kapranov, Philipp ;
Dumais, Jacqueline ;
Samsonova, Anastasia ;
Choi, Jeong-Hyeon ;
Roberts, Johnny ;
Davis, Carrie A. ;
Tang, Haixu ;
van Baren, Marijke J. ;
Ghosh, Srinka ;
Dobin, Alexander ;
Bell, Kim ;
Lin, Wei ;
Langton, Laura ;
Duff, Michael O. ;
Tenney, Aaron E. ;
Zaleski, Chris ;
Brent, Michael R. ;
Hoskins, Roger A. ;
Kaufman, Thomas C. ;
Andrews, Justen ;
Graveley, Brenton R. ;
Perrimon, Norbert ;
Celniker, Susan E. ;
Gingeras, Thomas R. ;
Cherbas, Peter .
GENOME RESEARCH, 2011, 21 (02) :301-314
[8]  
Craven P., 1979, Numerische Mathematik, V31, P377, DOI 10.1007/BF01404567
[9]   Model-based clustering, discriminant analysis, and density estimation [J].
Fraley, C ;
Raftery, AE .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2002, 97 (458) :611-631
[10]  
Fraley C., 2006, MCLUST Version 3 for R: Normal Mixture Modeling and Model-Based Clustering