Heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences

被引:1101
作者
Zhu, Anqi [1 ]
Ibrahim, Joseph G. [1 ]
Love, Michael I. [1 ,2 ]
机构
[1] Univ N Carolina, Dept Biostat, Chapel Hill, NC 27599 USA
[2] Univ N Carolina, Dept Genet, Chapel Hill, NC 27599 USA
关键词
RNA-SEQ EXPERIMENTS; EXPRESSION ANALYSIS; GENE;
D O I
10.1093/bioinformatics/bty895
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation In RNA-seq differential expression analysis, investigators aim to detect those genes with changes in expression level across conditions, despite technical and biological variability in the observations. A common task is to accurately estimate the effect size, often in terms of a logarithmic fold change (LFC). Results When the read counts are low or highly variable, the maximum likelihood estimates for the LFCs has high variance, leading to large estimates not representative of true differences, and poor ranking of genes by effect size. One approach is to introduce filtering thresholds and pseudocounts to exclude or moderate estimated LFCs. Filtering may result in a loss of genes from the analysis with true differences in expression, while pseudocounts provide a limited solution that must be adapted per dataset. Here, we propose the use of a heavy-tailed Cauchy prior distribution for effect sizes, which avoids the use of filter thresholds or pseudocounts. The proposed method, Approximate Posterior Estimation for generalized linear model, apeglm, has lower bias than previously proposed shrinkage estimators, while still reducing variance for those genes with little information for statistical inference. Availability and implementation The apeglm package is available as an R/Bioconductor package at https://bioconductor.org/packages/apeglm, and the methods can be called from within the DESeq2 software. Supplementary information Supplementary data are available at Bioinformatics online.
引用
收藏
页码:2084 / 2092
页数:9
相关论文
共 38 条
[1]   Differential expression analysis for sequence count data [J].
Anders, Simon ;
Huber, Wolfgang .
GENOME BIOLOGY, 2010, 11 (10)
[2]   Evaluating Gene Expression in C57BL/6J and DBA/2J Mouse Striatum Using RNA-Seq and Microarrays [J].
Bottomly, Daniel ;
Walter, Nicole A. R. ;
Hunter, Jessica Ezzell ;
Darakjian, Priscila ;
Kawane, Sunita ;
Buck, Kari J. ;
Searles, Robert P. ;
Mooney, Michael ;
McWeeney, Shannon K. ;
Hitzemann, Robert .
PLOS ONE, 2011, 6 (03)
[3]  
Brent R, 1973, ALGORITHMS MINIMIZAT
[4]  
Chen Yunshun, 2016, F1000Res, V5, P1438, DOI 10.12688/f1000research.8987.2
[5]   Statistical validation of peptide identifications in large-scale proteomics using the target-decoy database search strategy and flexible mixture modeling [J].
Choi, Hyungwon ;
Ghosh, Debashis ;
Nesvizhskii, Alexey I. .
JOURNAL OF PROTEOME RESEARCH, 2008, 7 (01) :286-292
[6]   DATA-ANALYSIS USING STEINS ESTIMATOR AND ITS GENERALIZATIONS [J].
EFRON, B ;
MORRIS, C .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1975, 70 (350) :311-319
[7]   A WEAKLY INFORMATIVE DEFAULT PRIOR DISTRIBUTION FOR LOGISTIC AND OTHER REGRESSION MODELS [J].
Gelman, Andrew ;
Jakulin, Aleks ;
Pittau, Maria Grazia ;
Su, Yu-Sung .
ANNALS OF APPLIED STATISTICS, 2008, 2 (04) :1360-1383
[8]   baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data [J].
Hardcastle, Thomas J. ;
Kelly, Krystyna A. .
BMC BIOINFORMATICS, 2010, 11
[9]   quantro: a data-driven approach to guide the choice of an appropriate normalization method [J].
Hicks, Stephanie C. ;
Irizarry, Rafael A. .
GENOME BIOLOGY, 2015, 16
[10]   RNA-seq mixology: designing realistic control experiments to compare protocols and analysis methods [J].
Holik, Aliaksei Z. ;
Law, Charity W. ;
Liu, Ruijie ;
Wang, Zeya ;
Wang, Wenyi ;
Ahn, Jaeil ;
Asselin-Labat, Marie-Liesse ;
Smyth, Gordon K. ;
Ritchie, Matthew E. .
NUCLEIC ACIDS RESEARCH, 2017, 45 (05)