BADGE: A novel Bayesian model for accurate abundance quantification and differential analysis of RNA-Seq data

被引:10
作者
Gu, Jinghua [1 ]
Wang, Xiao [1 ]
Halakivi-Clarke, Leena [2 ]
Clarke, Robert [2 ]
Xuan, Jianhua [1 ]
机构
[1] Virginia Polytech Inst & State Univ, Dept Elect & Comp Engn, Blacksburg, VA 24061 USA
[2] Georgetown Univ, Dept Oncol, Lombardi Comprehens Canc Ctr, Washington, DC USA
来源
BMC BIOINFORMATICS | 2014年 / 15卷
基金
美国国家卫生研究院;
关键词
EXPRESSION ANALYSIS; GENE-EXPRESSION; TRANSCRIPTOMES; INFERENCE;
D O I
10.1186/1471-2105-15-S9-S6
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Recent advances in RNA sequencing (RNA-Seq) technology have offered unprecedented scope and resolution for transcriptome analysis. However, precise quantification of mRNA abundance and identification of differentially expressed genes are complicated due to biological and technical variations in RNA-Seq data. Results: We systematically study the variation in count data and dissect the sources of variation into between-sample variation and within-sample variation. A novel Bayesian framework is developed for joint estimate of gene level mRNA abundance and differential state, which models the intrinsic variability in RNA-Seq to improve the estimation. Specifically, a Poisson-Lognormal model is incorporated into the Bayesian framework to model within-sample variation; a Gamma-Gamma model is then used to model between-sample variation, which accounts for over-dispersion of read counts among multiple samples. Simulation studies, where sequencing counts are synthesized based on parameters learned from real datasets, have demonstrated the advantage of the proposed method in both quantification of mRNA abundance and identification of differentially expressed genes. Moreover, performance comparison on data from the Sequencing Quality Control (SEQC) Project with ERCC spike-in controls has shown that the proposed method outperforms existing RNA-Seq methods in differential analysis. Application on breast cancer dataset has further illustrated that the proposed Bayesian model can 'blindly' estimate sources of variation caused by sequencing biases. Conclusions: We have developed a novel Bayesian hierarchical approach to investigate within-sample and between-sample variations in RNA-Seq data. Simulation and real data applications have validated desirable performance of the proposed method.
引用
收藏
页数:11
相关论文
共 26 条
  • [1] Differential expression analysis for sequence count data
    Anders, Simon
    Huber, Wolfgang
    [J]. GENOME BIOLOGY, 2010, 11 (10):
  • [2] Summarizing and correcting the GC content bias in high-throughput sequencing
    Benjamini, Yuval
    Speed, Terence P.
    [J]. NUCLEIC ACIDS RESEARCH, 2012, 40 (10) : e72
  • [3] Evaluating Gene Expression in C57BL/6J and DBA/2J Mouse Striatum Using RNA-Seq and Microarrays
    Bottomly, Daniel
    Walter, Nicole A. R.
    Hunter, Jessica Ezzell
    Darakjian, Priscila
    Kawane, Sunita
    Buck, Kari J.
    Searles, Robert P.
    Mooney, Michael
    McWeeney, Shannon K.
    Hitzemann, Robert
    [J]. PLOS ONE, 2011, 6 (03):
  • [4] Polymorphic Cis- and Trans-Regulation of Human Gene Expression
    Cheung, Vivian G.
    Nayak, Renuka R.
    Wang, Isabel Xiaorong
    Elwyn, Susannah
    Cousins, Sarah M.
    Morley, Michael
    Spielman, Richard S.
    [J]. PLOS BIOLOGY, 2010, 8 (09)
  • [5] A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis
    Dillies, Marie-Agnes
    Rau, Andrea
    Aubert, Julie
    Hennequet-Antier, Christelle
    Jeanmougin, Marine
    Servant, Nicolas
    Keime, Celine
    Marot, Guillemette
    Castel, David
    Estelle, Jordi
    Guernec, Gregory
    Jagla, Bernd
    Jouneau, Luc
    Laloe, Denis
    Le Gall, Caroline
    Schaeffer, Brigitte
    Le Crom, Stephane
    Guedj, Mickael
    Jaffrezic, Florence
    [J]. BRIEFINGS IN BIOINFORMATICS, 2013, 14 (06) : 671 - 683
  • [6] Identification of fusion genes in breast cancer by paired-end RNA-sequencing
    Edgren, Henrik
    Murumagi, Astrid
    Kangaspeska, Sara
    Nicorici, Daniel
    Hongisto, Vesa
    Kleivi, Kristine
    Rye, Inga H.
    Nyberg, Sandra
    Wolf, Maija
    Borresen-Dale, Anne-Lise
    Kallioniemi, Olli
    [J]. GENOME BIOLOGY, 2011, 12 (01):
  • [7] Biases in Illumina transcriptome sequencing caused by random hexamer priming
    Hansen, Kasper D.
    Brenner, Steven E.
    Dudoit, Sandrine
    [J]. NUCLEIC ACIDS RESEARCH, 2010, 38 (12) : e131
  • [8] Removing technical variability in RNA-seq data using conditional quantile normalization
    Hansen, Kasper D.
    Irizarry, Rafael A.
    WU, Zhijin
    [J]. BIOSTATISTICS, 2012, 13 (02) : 204 - 216
  • [9] Using Poisson mixed-effects model to quantify transcript-level gene expression in RNA-Seq
    Hu, Ming
    Zhu, Yu
    Taylor, Jeremy M. G.
    Liu, Jun S.
    Qin, Zhaohui S.
    [J]. BIOINFORMATICS, 2012, 28 (01) : 63 - 68
  • [10] Comprehensive molecular portraits of human breast tumours
    Koboldt, Daniel C.
    Fulton, Robert S.
    McLellan, Michael D.
    Schmidt, Heather
    Kalicki-Veizer, Joelle
    McMichael, Joshua F.
    Fulton, Lucinda L.
    Dooling, David J.
    Ding, Li
    Mardis, Elaine R.
    Wilson, Richard K.
    Ally, Adrian
    Balasundaram, Miruna
    Butterfield, Yaron S. N.
    Carlsen, Rebecca
    Carter, Candace
    Chu, Andy
    Chuah, Eric
    Chun, Hye-Jung E.
    Coope, Robin J. N.
    Dhalla, Noreen
    Guin, Ranabir
    Hirst, Carrie
    Hirst, Martin
    Holt, Robert A.
    Lee, Darlene
    Li, Haiyan I.
    Mayo, Michael
    Moore, Richard A.
    Mungall, Andrew J.
    Pleasance, Erin
    Robertson, A. Gordon
    Schein, Jacqueline E.
    Shafiei, Arash
    Sipahimalani, Payal
    Slobodan, Jared R.
    Stoll, Dominik
    Tam, Angela
    Thiessen, Nina
    Varhol, Richard J.
    Wye, Natasja
    Zeng, Thomas
    Zhao, Yongjun
    Birol, Inanc
    Jones, Steven J. M.
    Marra, Marco A.
    Cherniack, Andrew D.
    Saksena, Gordon
    Onofrio, Robert C.
    Pho, Nam H.
    [J]. NATURE, 2012, 490 (7418) : 61 - 70