Tissue-aware RNA-Seq processing and normalization for heterogeneous and sparse data

被引:32
作者
Paulson, Joseph N. [1 ,2 ,6 ]
Chen, Cho-Yi [1 ,2 ]
Lopes-Ramos, Camila M. [1 ,2 ]
Kuijjer, Marieke L. [1 ,2 ]
Platig, John [1 ,2 ]
Sonawane, Abhijeet R. [3 ,4 ]
Fagny, Maud [1 ,2 ]
Glass, Kimberly [1 ,2 ,3 ,4 ]
Quackenbush, John [1 ,2 ,3 ,4 ,5 ]
机构
[1] Dana Farber Canc Inst, Dept Biostat & Computat Biol, Boston, MA 02215 USA
[2] Harvard Sch Publ Hlth, Dept Biostat, Boston, MA 02215 USA
[3] Brigham & Womens Hosp, Channing Div Network Med, Boston, MA 02215 USA
[4] Harvard Med Sch, Boston, MA 02215 USA
[5] Dana Farber Canc Inst, Dept Canc Biol, Boston, MA 02215 USA
[6] Genentech Inc, Dept Biostat, Prod Dev, 1 DNA Way, San Francisco, CA 94080 USA
基金
美国国家卫生研究院;
关键词
GTEx; RNA-Seq; Quality control; Filtering; Preprocessing; Normalization; DIFFERENTIAL EXPRESSION ANALYSIS;
D O I
10.1186/s12859-017-1847-x
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Although ultrahigh-throughput RNA-Sequencing has become the dominant technology for genomewide transcriptional profiling, the vast majority of RNA-Seq studies typically profile only tens of samples, and most analytical pipelines are optimized for these smaller studies. However, projects are generating ever-larger data sets comprising RNA-Seq data from hundreds or thousands of samples, often collected at multiple centers and from diverse tissues. These complex data sets present significant analytical challenges due to batch and tissue effects, but provide the opportunity to revisit the assumptions and methods that we use to preprocess, normalize, and filter RNA-Seq data -critical first steps for any subsequent analysis. Results: We find that analysis of large RNA-Seq data sets requires both careful quality control and the need to account for sparsity due to the heterogeneity intrinsic in multi-group studies. We developed Yet Another RNA Normalization software pipeline (YARN), that includes quality control and preprocessing, gene filtering, and normalization steps designed to facilitate downstream analysis of large, heterogeneous RNA-Seq data sets and we demonstrate its use with data from the Genotype-Tissue Expression (GTEx) project. Conclusions: An R package instantiating YARN is available at http://bioconductor. org/packages/yarn.
引用
收藏
页数:10
相关论文
共 27 条
[1]   Differential expression analysis for sequence count data [J].
Anders, Simon ;
Huber, Wolfgang .
GENOME BIOLOGY, 2010, 11 (10)
[2]   The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans [J].
Ardlie, Kristin G. ;
DeLuca, David S. ;
Segre, Ayellet V. ;
Sullivan, Timothy J. ;
Young, Taylor R. ;
Gelfand, Ellen T. ;
Trowbridge, Casandra A. ;
Maller, Julian B. ;
Tukiainen, Taru ;
Lek, Monkol ;
Ward, Lucas D. ;
Kheradpour, Pouya ;
Iriarte, Benjamin ;
Meng, Yan ;
Palmer, Cameron D. ;
Esko, Tonu ;
Winckler, Wendy ;
Hirschhorn, Joel N. ;
Kellis, Manolis ;
MacArthur, Daniel G. ;
Getz, Gad ;
Shabalin, Andrey A. ;
Li, Gen ;
Zhou, Yi-Hui ;
Nobel, Andrew B. ;
Rusyn, Ivan ;
Wright, Fred A. ;
Lappalainen, Tuuli ;
Ferreira, Pedro G. ;
Ongen, Halit ;
Rivas, Manuel A. ;
Battle, Alexis ;
Mostafavi, Sara ;
Monlong, Jean ;
Sammeth, Michael ;
Mele, Marta ;
Reverter, Ferran ;
Goldmann, Jakob M. ;
Koller, Daphne ;
Guigo, Roderic ;
McCarthy, Mark I. ;
Dermitzakis, Emmanouil T. ;
Gamazon, Eric R. ;
Im, Hae Kyung ;
Konkashbaev, Anuar ;
Nicolae, Dan L. ;
Cox, Nancy J. ;
Flutre, Timothee ;
Wen, Xiaoquan ;
Stephens, Matthew .
SCIENCE, 2015, 348 (6235) :648-660
[3]   A comparison of normalization methods for high density oligonucleotide array data based on variance and bias [J].
Bolstad, BM ;
Irizarry, RA ;
Åstrand, M ;
Speed, TP .
BIOINFORMATICS, 2003, 19 (02) :185-193
[4]   Independent filtering increases detection power for high-throughput experiments [J].
Bourgon, Richard ;
Gentleman, Robert ;
Huber, Wolfgang .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2010, 107 (21) :9546-9551
[5]   Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments [J].
Bullard, James H. ;
Purdom, Elizabeth ;
Hansen, Kasper D. ;
Dudoit, Sandrine .
BMC BIOINFORMATICS, 2010, 11
[6]  
Chen C-Y, 2016, BIORXIV, P82289
[7]   Comprehensive genomic characterization defines human glioblastoma genes and core pathways [J].
Chin, L. ;
Meyerson, M. ;
Aldape, K. ;
Bigner, D. ;
Mikkelsen, T. ;
VandenBerg, S. ;
Kahn, A. ;
Penny, R. ;
Ferguson, M. L. ;
Gerhard, D. S. ;
Getz, G. ;
Brennan, C. ;
Taylor, B. S. ;
Winckler, W. ;
Park, P. ;
Ladanyi, M. ;
Hoadley, K. A. ;
Verhaak, R. G. W. ;
Hayes, D. N. ;
Spellman, Paul T. ;
Absher, D. ;
Weir, B. A. ;
Ding, L. ;
Wheeler, D. ;
Lawrence, M. S. ;
Cibulskis, K. ;
Mardis, E. ;
Zhang, Jinghui ;
Wilson, R. K. ;
Donehower, L. ;
Wheeler, D. A. ;
Purdom, E. ;
Wallis, J. ;
Laird, P. W. ;
Herman, J. G. ;
Schuebel, K. E. ;
Weisenberger, D. J. ;
Baylin, S. B. ;
Schultz, N. ;
Yao, Jun ;
Wiedemeyer, R. ;
Weinstein, J. ;
Sander, C. ;
Gibbs, R. A. ;
Gray, J. ;
Kucherlapati, R. ;
Lander, E. S. ;
Myers, R. M. ;
Perou, C. M. ;
McLendon, Roger .
NATURE, 2008, 455 (7216) :1061-1068
[8]   A survey of best practices for RNA-seq data analysis [J].
Conesa, Ana ;
Madrigal, Pedro ;
Tarazona, Sonia ;
Gomez-Cabrero, David ;
Cervera, Alejandra ;
McPherson, Andrew ;
Szczesniak, Michal Wojciech ;
Gaffney, Daniel J. ;
Elo, Laura L. ;
Zhang, Xuegong ;
Mortazavi, Ali .
GENOME BIOLOGY, 2016, 17
[9]   Human housekeeping genes, revisited [J].
Eisenberg, Eli ;
Levanon, Erez Y. .
TRENDS IN GENETICS, 2013, 29 (10) :569-574
[10]   Exploring regulation in tissues with eQTL networks [J].
Fagny, Maud ;
Paulson, Joseph N. ;
Kuijjer, Marieke L. ;
Sonawane, Abhijeet R. ;
Chen, Cho-Yi ;
Lopes-Ramos, Camila M. ;
Glass, Kimberly ;
Quackenbush, John ;
Platig, John .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2017, 114 (37) :E7841-E7850