ComBat-seq: batch effect adjustment for RNA-seq count data

被引:774
作者
Zhang, Yuqing [1 ]
Parmigiani, Giovanni [2 ,3 ]
Johnson, W. Evan [4 ,5 ,6 ]
机构
[1] Gilead Sci Inc, Dept Bioinformat & Clin Data Sci, 333 Lakeside Dr, Foster City, CA 94404 USA
[2] Dana Farber Canc Inst, Dept Data Sci, 450 Brookline Ave, Boston, MA 02215 USA
[3] Harvard TH Chan Sch Publ Hlth, Dept Biostat, 677 Huntington Ave, Boston, MA 02115 USA
[4] Boston Univ, Sch Med, Div Computat Biomed, 72 East Concord St, Boston, MA 02118 USA
[5] Boston Univ, Grad Program Bioinformat, 24 Cummington Mall, Boston, MA 02215 USA
[6] Boston Univ, Sch Publ Hlth, Dept Biostat, 715 Albany St, Boston, MA 02118 USA
基金
美国国家科学基金会; 美国国家卫生研究院;
关键词
DIFFERENTIAL EXPRESSION ANALYSIS; PACKAGE;
D O I
10.1093/nargab/lqaa078
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
The benefit of integrating batches of genomic data to increase statistical power is often hindered by batch effects, or unwanted variation in data caused by differences in technical factors across batches. It is therefore critical to effectively address batch effects in genomic data to overcome these challenges. Many existing methods for batch effects adjustment assume the data follow a continuous, bell-shaped Gaussian distribution. However in RNA-seq studies the data are typically skewed, over-dispersed counts, so this assumption is not appropriate and may lead to erroneous results. Negative binomial regression models have been used previously to better capture the properties of counts. We developed a batch correction method, ComBat-seq, using a negative binomial regression model that retains the integer nature of count data in RNA-seq studies, making the batch adjusted data compatible with common differential expression software packages that require integer counts. We show in realistic simulations that the ComBat-seq adjusted data results in better statistical power and control of false positives in differential expression compared to data adjusted by the other available methods. We further demonstrated in a real data example that ComBat-seq successfully removes batch effects and recovers the biological signal in the data.
引用
收藏
页数:10
相关论文
共 17 条
[1]  
[Anonymous], 2020, RAS PATHW V2 0
[2]  
Chen YS, 2014, FRONT PROBAB STAT SC, P51, DOI 10.1007/978-3-319-07212-8_3
[3]   Polyester: simulating RNA-seq datasets with differential transcript expression [J].
Frazee, Alyssa C. ;
Jaffe, Andrew E. ;
Langmead, Ben ;
Leek, Jeffrey T. .
BIOINFORMATICS, 2015, 31 (17) :2778-2784
[4]   Adjusting batch effects in microarray expression data using empirical Bayes methods [J].
Johnson, W. Evan ;
Li, Cheng ;
Rabinovic, Ariel .
BIOSTATISTICS, 2007, 8 (01) :118-127
[5]   voom: precision weights unlock linear model analysis tools for RNA-seq read counts [J].
Law, Charity W. ;
Chen, Yunshun ;
Shi, Wei ;
Smyth, Gordon K. .
GENOME BIOLOGY, 2014, 15 (02)
[6]   svaseq: removing batch effects and other unwanted noise from sequencing data [J].
Leek, Jeffrey T. .
NUCLEIC ACIDS RESEARCH, 2014, 42 (21) :e161
[7]   The sva package for removing batch effects and other unwanted variation in high-throughput experiments [J].
Leek, Jeffrey T. ;
Johnson, W. Evan ;
Parker, Hilary S. ;
Jaffe, Andrew E. ;
Storey, John D. .
BIOINFORMATICS, 2012, 28 (06) :882-883
[8]   Tackling the widespread and critical impact of batch effects in high-throughput data [J].
Leek, Jeffrey T. ;
Scharpf, Robert B. ;
Bravo, Hector Corrada ;
Simcha, David ;
Langmead, Benjamin ;
Johnson, W. Evan ;
Geman, Donald ;
Baggerly, Keith ;
Irizarry, Rafael A. .
NATURE REVIEWS GENETICS, 2010, 11 (10) :733-739
[9]   Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 [J].
Love, Michael I. ;
Huber, Wolfgang ;
Anders, Simon .
GENOME BIOLOGY, 2014, 15 (12)
[10]   BatchQC: interactive software for evaluating sample and batch effects in genomic data [J].
Manimaran, Solaiappan ;
Selby, Heather Marie ;
Okrah, Kwame ;
Ruberman, Claire ;
Leek, Jeffrey T. ;
Quackenbush, John ;
Haibe-Kains, Benjamin ;
Bravo, Hector Corrada ;
Johnson, W. Evan .
BIOINFORMATICS, 2016, 32 (24) :3836-3838