Bias detection and correction in RNA-Sequencing data

被引:111
作者
Zheng, Wei [1 ]
Chung, Lisa M. [2 ]
Zhao, Hongyu [1 ,2 ]
机构
[1] Yale Univ, Keck Lab, New Haven, CT 06510 USA
[2] Yale Univ, Sch Publ Hlth, Div Biostat, New Haven, CT 06510 USA
基金
美国国家卫生研究院;
关键词
GENE-EXPRESSION; DIFFERENTIAL EXPRESSION; SEQ; NORMALIZATION; TRANSCRIPTOME; RESOLUTION;
D O I
10.1186/1471-2105-12-290
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: High throughput sequencing technology provides us unprecedented opportunities to study transcriptome dynamics. Compared to microarray based gene expression profiling, RNA-Seq has many advantages, such as high resolution, low background, and ability to identify novel transcripts. Moreover, for genes with multiple isoforms, expression of each isoform may be estimated from RNA-Seq data. Despite these advantages, recent work revealed that base level read counts from RNA-Seq data may not be randomly distributed and can be affected by local nucleotide composition. It was not clear though how the base level read count bias may affect gene level expression estimates. Results: In this paper, by using five published RNA-Seq data sets from different biological sources and with different data preprocessing schemes, we showed that commonly used estimates of gene expression levels from RNA-Seq data, such as reads per kilobase of gene length per million reads (RPKM), are biased in terms of gene length, GC content and dinucleotide frequencies. We directly examined the biases at the gene-level, and proposed a simple generalized-additive-model based approach to correct different sources of biases simultaneously. Compared to previously proposed base level correction methods, our method reduces bias in gene-level expression estimates more effectively. Conclusions: Our method identifies and corrects different sources of biases in gene-level expression measures from RNA-Seq data, and provides more accurate estimates of gene expression levels from RNA-Seq. This method should prove useful in meta-analysis of gene expression levels using different platforms or experimental protocols.
引用
收藏
页数:14
相关论文
共 34 条
[1]   COMPLEMENTARY-DNA SEQUENCING - EXPRESSED SEQUENCE TAGS AND HUMAN GENOME PROJECT [J].
ADAMS, MD ;
KELLEY, JM ;
GOCAYNE, JD ;
DUBNICK, M ;
POLYMEROPOULOS, MH ;
XIAO, H ;
MERRIL, CR ;
WU, A ;
OLDE, B ;
MORENO, RF ;
KERLAVAGE, AR ;
MCCOMBIE, WR ;
VENTER, JC .
SCIENCE, 1991, 252 (5013) :1651-1656
[2]   Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments [J].
Bullard, James H. ;
Purdom, Elizabeth ;
Hansen, Kasper D. ;
Dudoit, Sandrine .
BMC BIOINFORMATICS, 2010, 11
[3]   Evaluation of DNA microarray results with quantitative gene expression platforms [J].
Canales, Roger D. ;
Luo, Yuling ;
Willey, James C. ;
Austermiller, Bradley ;
Barbacioru, Catalin C. ;
Boysen, Cecilie ;
Hunkapiller, Kathryn ;
Jensen, Roderick V. ;
Knight, Charles R. ;
Lee, Kathleen Y. ;
Ma, Yunqing ;
Maqsodi, Botoul ;
Papallo, Adam ;
Peters, Elizabeth Herness ;
Poulter, Karen ;
Ruppel, Patricia L. ;
Samaha, Raymond R. ;
Shi, Leming ;
Yang, Wen ;
Zhang, Lu ;
Goodsaid, Federico M. .
NATURE BIOTECHNOLOGY, 2006, 24 (09) :1115-1122
[4]   Substantial biases in ultra-short read data sets from high-throughput DNA sequencing [J].
Dohm, Juliane C. ;
Lottaz, Claudio ;
Borodina, Tatiana ;
Himmelbauer, Heinz .
NUCLEIC ACIDS RESEARCH, 2008, 36 (16)
[5]   Statistical issues in the analysis of Illumina data [J].
Dunning, Mark J. ;
Barbosa-Morais, Nuno L. ;
Lynch, Andy G. ;
Tavare, Simon ;
Ritchie, Matthew E. .
BMC BIOINFORMATICS, 2008, 9 (1)
[6]   A code for transcription initiation in mammalian genomes [J].
Frith, Martin C. ;
Valen, Eivind ;
Krogh, Anders ;
Hayashizaki, Yoshihide ;
Carninci, Piero ;
Sandelin, Albin .
GENOME RESEARCH, 2008, 18 (01) :1-12
[7]  
GABRIEL KR, 1971, BIOMETRIKA, V58, P453, DOI 10.2307/2334381
[8]   Length bias correction for RNA-seq data in gene set analyses [J].
Gao, Liyan ;
Fang, Zhide ;
Zhang, Kui ;
Zhi, Degui ;
Cui, Xiangqin .
BIOINFORMATICS, 2011, 27 (05) :662-669
[9]   Biases in Illumina transcriptome sequencing caused by random hexamer priming [J].
Hansen, Kasper D. ;
Brenner, Steven E. ;
Dudoit, Sandrine .
NUCLEIC ACIDS RESEARCH, 2010, 38 (12) :e131
[10]   CpG doublets, CpG islands and Alu repeats in long human DNA sequences from different isochore families [J].
Jabbari, K ;
Bernardi, G .
GENE, 1998, 224 (1-2) :123-128