Detecting and correcting systematic variation in large-scale RNA sequencing data

被引:121
作者
Li, Sheng [1 ,2 ]
Labaj, Pawel P. [3 ]
Zumbo, Paul [1 ,2 ]
Sykacek, Peter [3 ]
Shi, Wei [4 ]
Shi, Leming [5 ,6 ,7 ]
Phan, John [8 ]
Wu, Po-Yen [8 ]
Wang, May [8 ]
Wang, Charles [9 ,10 ]
Thierry-Mieg, Danielle [11 ]
Thierry-Mieg, Jean [11 ]
Kreil, David P. [3 ,12 ]
Mason, Christopher E. [1 ,2 ,13 ]
机构
[1] Weill Cornell Med Coll, Dept Physiol & Biophys, New York, NY 10065 USA
[2] Weill Cornell Med Coll, HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsau, New York, NY USA
[3] Boku Univ Vienna, Bioinformat Res Grp, Vienna, Austria
[4] WEHI, Dept Bioinformat, Melbourne, Vic, Australia
[5] Fudan Univ, State Key Lab Genet Engn, Sch Life Sci, Shanghai 200433, Peoples R China
[6] Fudan Univ, MOE Key Lab Contemporary Anthropol, Sch Life Sci, Shanghai 200433, Peoples R China
[7] Fudan Univ, Sch Pharm, Shanghai 200433, Peoples R China
[8] Georgia Inst Technol, Sch Elect & Comp Engn, Atlanta, GA 30332 USA
[9] Loma Linda Univ, Ctr Genom, Loma Linda, CA 92350 USA
[10] Loma Linda Univ, Sch Med, Div Microbiol & Mol Genet, Loma Linda, CA USA
[11] Natl Ctr Biotechnol Informat, Bethesda, MD USA
[12] Univ Warwick, Coventry CV4 7AL, W Midlands, England
[13] Feil Family Brain & Mind Res Inst, New York, NY USA
基金
美国国家卫生研究院;
关键词
QUALITY-CONTROL; GENE-EXPRESSION; DIFFERENTIAL EXPRESSION; UNWANTED VARIATION; MESSENGER-RNA; SEQ; NORMALIZATION; TRANSCRIPTS; ALGORITHMS; PACKAGE;
D O I
10.1038/nbt.3000
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
High-throughput RNA sequencing (RNA-seq) enables comprehensive scans of entire transcriptomes, but best practices for analyzing RNA-seq data have not been fully defined, particularly for data collected with multiple sequencing platforms or at multiple sites. Here we used standardized RNA samples with built-in controls to examine sources of error in large-scale RNA-seq studies and their impact on the detection of differentially expressed genes (DEGs). Analysis of variations in guanine-cytosine content, gene coverage, sequencing error rate and insert size allowed identification of decreased reproducibility across sites. Moreover, commonly used methods for normalization (cqn, EDASeq, RUV2, sva, PEER) varied in their ability to remove these systematic biases, depending on sample complexity and initial data quality. Normalization methods that combine data from genes across sites are strongly recommended to identify and remove site-specific effects and can substantially improve RNA-seq studies.
引用
收藏
页码:888 / 895
页数:8
相关论文
共 52 条
  • [1] Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries
    Aird, Daniel
    Ross, Michael G.
    Chen, Wei-Sheng
    Danielsson, Maxwell
    Fennell, Timothy
    Russ, Carsten
    Jaffe, David B.
    Nusbaum, Chad
    Gnirke, Andreas
    [J]. GENOME BIOLOGY, 2011, 12 (02)
  • [2] HTSeq-a Python']Python framework to work with high-throughput sequencing data
    Anders, Simon
    Pyl, Paul Theodor
    Huber, Wolfgang
    [J]. BIOINFORMATICS, 2015, 31 (02) : 166 - 169
  • [3] Assessing the accuracy of prediction algorithms for classification: an overview
    Baldi, P
    Brunak, S
    Chauvin, Y
    Andersen, CAF
    Nielsen, H
    [J]. BIOINFORMATICS, 2000, 16 (05) : 412 - 424
  • [4] MGED standards: Work in progress
    Ball, Catherine A.
    Brazma, Alvis
    [J]. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY, 2006, 10 (02) : 138 - 144
  • [5] Summarizing and correcting the GC content bias in high-throughput sequencing
    Benjamini, Yuval
    Speed, Terence P.
    [J]. NUCLEIC ACIDS RESEARCH, 2012, 40 (10) : e72
  • [6] The NIH Roadmap Epigenomics Mapping Consortium
    Bernstein, Bradley E.
    Stamatoyannopoulos, John A.
    Costello, Joseph F.
    Ren, Bing
    Milosavljevic, Aleksandar
    Meissner, Alexander
    Kellis, Manolis
    Marra, Marco A.
    Beaudet, Arthur L.
    Ecker, Joseph R.
    Farnham, Peggy J.
    Hirst, Martin
    Lander, Eric S.
    Mikkelsen, Tarjei S.
    Thomson, James A.
    [J]. NATURE BIOTECHNOLOGY, 2010, 28 (10) : 1045 - 1048
  • [7] Shining a Light on Dark Sequencing: Characterising Errors in Ion Torrent PGM Data
    Bragg, Lauren M.
    Stone, Glenn
    Butler, Margaret K.
    Hugenholtz, Philip
    Tyson, Gene W.
    [J]. PLOS COMPUTATIONAL BIOLOGY, 2013, 9 (04)
  • [8] Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments
    Bullard, James H.
    Purdom, Elizabeth
    Hansen, Kasper D.
    Dudoit, Sandrine
    [J]. BMC BIOINFORMATICS, 2010, 11
  • [9] Empowering microarrays in the regulatory setting
    Casciano, Daniel A.
    Woodcock, Janet
    [J]. NATURE BIOTECHNOLOGY, 2006, 24 (09) : 1103 - 1103
  • [10] Polymorphic Cis- and Trans-Regulation of Human Gene Expression
    Cheung, Vivian G.
    Nayak, Renuka R.
    Wang, Isabel Xiaorong
    Elwyn, Susannah
    Cousins, Sarah M.
    Morley, Michael
    Spielman, Richard S.
    [J]. PLOS BIOLOGY, 2010, 8 (09)