Trimming of sequence reads alters RNA-Seq gene expression estimates

被引:114
作者
Williams, Claire R. [1 ]
Baccarella, Alyssa [2 ]
Parrish, Jay Z. [1 ]
Kim, Charles C. [2 ,3 ]
机构
[1] Univ Washington, Dept Biol, Seattle, WA 98195 USA
[2] Univ Calif San Francisco, Div Expt Med, Dept Med, San Francisco, CA 94110 USA
[3] Verily, Mountain View, CA 94043 USA
基金
美国国家卫生研究院;
关键词
RNA-Seq; Trimming; Gene expression; Drosophila; MESSENGER-RNA; QUALITY ASSESSMENT; DIFFERENTIAL GENE; TRANSCRIPTOME; AMPLIFICATION; COMPLEXITY; ALIGNMENT; NEURONS; TOPHAT; BIASES;
D O I
10.1186/s12859-016-0956-2
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: High-throughput RNA-Sequencing (RNA-Seq) has become the preferred technique for studying gene expression differences between biological samples and for discovering novel isoforms, though the techniques to analyze the resulting data are still immature. One pre-processing step that is widely but heterogeneously applied is trimming, in which low quality bases, identified by the probability that they are called incorrectly, are removed. However, the impact of trimming on subsequent alignment to a genome could influence downstream analyses including gene expression estimation; we hypothesized that this might occur in an inconsistent manner across different genes, resulting in differential bias. Results: To assess the effects of trimming on gene expression, we generated RNA-Seq data sets from four samples of larval Drosophila melanogaster sensory neurons, and used three trimming algorithms-SolexaQA, Trimmomatic, and ConDeTri-to perform quality-based trimming across a wide range of stringencies. After aligning the reads to the D. melanogaster genome with TopHat2, we used Cuffdiff2 to compare the original, untrimmed gene expression estimates to those following trimming. With the most aggressive trimming parameters, over ten percent of genes had significant changes in their estimated expression levels. This trend was seen with two additional RNA-Seq data sets and with alternative differential expression analysis pipelines. We found that the majority of the expression changes could be mitigated by imposing a minimum length filter following trimming, suggesting that the differential gene expression was primarily being driven by spurious mapping of short reads. Slight differences with the untrimmed data set remained after length filtering, which were associated with genes with low exon numbers and high GC content. Finally, an analysis of paired RNA-seq/microarray data sets suggests that no or modest trimming results in the most biologically accurate gene expression estimates. Conclusions: We find that aggressive quality-based trimming has a large impact on the apparent makeup of RNA-Seq-based gene expression estimates, and that short reads can have a particularly strong impact. We conclude that implementation of trimming in RNA-Seq analysis workflows warrants caution, and if used, should be used in conjunction with a minimum read length filter to minimize the introduction of unpredictable changes in expression estimates.
引用
收藏
页数:13
相关论文
共 39 条
[1]   Quality assessment and control of tissue specific RNA-seq libraries of Drosophila transgenic RNAi models [J].
Amaral, Andreia J. ;
Brito, Francisco F. ;
Chobanyan, Tamar ;
Yoshikawa, Seiko ;
Yokokura, Takakazu ;
Van Vactor, David ;
Gama-Carvalho, Margarida .
FRONTIERS IN GENETICS, 2014, 5
[2]   HTSeq-a Python']Python framework to work with high-throughput sequencing data [J].
Anders, Simon ;
Pyl, Paul Theodor ;
Huber, Wolfgang .
BIOINFORMATICS, 2015, 31 (02) :166-169
[3]   New Drosophila transgenic reporters:: insulated P-element vectors expressing fast-maturing RFP [J].
Barolo, S ;
Castro, B ;
Posakony, JW .
BIOTECHNIQUES, 2004, 36 (03) :436-+
[4]   Trimmomatic: a flexible trimmer for Illumina sequence data [J].
Bolger, Anthony M. ;
Lohse, Marc ;
Usadel, Bjoern .
BIOINFORMATICS, 2014, 30 (15) :2114-2120
[5]   Realistic artificial DNA sequences as negative controls for computational genomics [J].
Caballero, Juan ;
Smit, Arian F. A. ;
Hood, Leroy ;
Glusman, Gustavo .
NUCLEIC ACIDS RESEARCH, 2014, 42 (12) :e99
[6]   SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data [J].
Cox, Murray P. ;
Peterson, Daniel A. ;
Biggs, Patrick J. .
BMC BIOINFORMATICS, 2010, 11
[7]   An Extensive Evaluation of Read Trimming Effects on Illumina NGS Data Analysis [J].
Del Fabbro, Cristian ;
Scalabrin, Simone ;
Morgante, Michele ;
Giorgi, Federico M. .
PLOS ONE, 2013, 8 (12)
[8]   STAR: ultrafast universal RNA-seq aligner [J].
Dobin, Alexander ;
Davis, Carrie A. ;
Schlesinger, Felix ;
Drenkow, Jorg ;
Zaleski, Chris ;
Jha, Sonali ;
Batut, Philippe ;
Chaisson, Mark ;
Gingeras, Thomas R. .
BIOINFORMATICS, 2013, 29 (01) :15-21
[9]   Substantial biases in ultra-short read data sets from high-throughput DNA sequencing [J].
Dohm, Juliane C. ;
Lottaz, Claudio ;
Borodina, Tatiana ;
Himmelbauer, Heinz .
NUCLEIC ACIDS RESEARCH, 2008, 36 (16)
[10]   Detrimental effects of duplicate reads and low complexity regions on RNA- and ChIP-seq data [J].
Dozmorov, Mikhail G. ;
Adrianto, Indra ;
Giles, Cory B. ;
Glass, Edmund ;
Glenn, Stuart B. ;
Montgomery, Courtney ;
Sivils, Kathy L. ;
Olson, Lorin E. ;
Iwayama, Tomoaki ;
Freeman, Willard M. ;
Lessard, Christopher J. ;
Wren, Jonathan D. .
BMC BIOINFORMATICS, 2015, 16