Estimators of the local false discovery rate designed for small numbers of tests

被引：15

作者：

Padilla, Marta ^{[1
]}

Bickel, David R. ^{[1
]}

机构：

[1] Univ Ottawa, Ottawa Inst Syst Biol, Dept Biochem Microbiol & Immunol, Ottawa, ON K1N 6N5, Canada

来源：

STATISTICAL APPLICATIONS IN GENETICS AND MOLECULAR BIOLOGY | 2012年 / 11卷 / 05期

基金：

加拿大创新基金会; 加拿大自然科学与工程研究理事会;

关键词：

empirical Bayes; local false discovery rate; medium-dimensional biology; medium-scale inference; minimum description length; penalized likelihood; reduced likelihood; selection bias; small-dimensional biology; small-scale inference; Type II maximum likelihood; DIFFERENTIAL GENE-EXPRESSION; EMPIRICAL BAYES METHODS; CONFIDENCE; BIOCONDUCTOR; SIZE;

D O I：

10.1515/1544-6115.1807

中图分类号：

Q5 [生物化学]; Q7 [分子生物学];

学科分类号：

071010 ; 081704 ;

摘要：

Histogram-based empirical Bayes methods developed for analyzing data for large numbers of genes, SNPs, or other biological features tend to have large biases when applied to data with a smaller number of features such as genes with expression measured conventionally, proteins, and metabolites. To analyze such small-scale and medium-scale data in an empirical Bayes framework, we introduce corrections of maximum likelihood estimators (MLEs) of the local false discovery rate (LFDR). In this context, the MLE estimates the LFDR, which is a posterior probability of null hypothesis truth, by estimating the prior distribution. The corrections lie in excluding each feature when estimating one or more parameters on which the prior depends. In addition, we propose the expected LFDR (ELFDR) in order to propagate the uncertainty involved in estimating the prior. We also introduce an optimally weighted combination of the best of the corrected MLEs with a previous estimator that, being based on a binomial distribution, does not require a parametric model of the data distribution across features. An application of the new estimators and previous estimators to protein abundance data illustrates the extent to which different estimators lead to different conclusions about which proteins are affected by cancer. A simulation study was conducted to approximate the bias of the new estimators relative to previous LFDR estimators. Data were simulated for two different numbers of features (N), two different noncentrality parameter values or detectability levels (d(alt)), and several proportions of unaffected features (p0). One of these previous estimators is a histogram-based estimator (HBE) designed for a large number of features. The simulations show that some of the corrected MLEs and the ELFDR that corrects the HBE reduce the negative bias relative to the MLE and the HBE, respectively. For every method, we defined the worst-case performance as the maximum of the absolute value of the bias over the two different dalt and over various p0. The best worst-case methods represent the safest methods to be used under given conditions. This analysis indicates that the binomial-based method has the lowest worst-case absolute bias for high p0 and for N = 3, 12. However, the corrected MLE that is based on the minimum description length (MDL) principle is the best worst-case method when the value of p0 is more uncertain since it has one of the lowest worst-case biases over all possible values of p0 and for N = 3, 12. Therefore, the safest estimator considered is the binomial-based method when a high proportion of unaffected features can be assumed and the MDL-based method otherwise. A second simulation study was conducted with additional values of N. We found that HBE requires N to be at least 6-12 features to perform as well as the estimators proposed here, with the precise minimum N depending on p0 and d(alt).

引用

页数：42

共 50 条

[1]

AITKIN M, 1991, J ROY STAT SOC B MET, V53, P111

[2]

[Anonymous], 2001, In all likelihood: statistical modelling and inference using likelihood

[3]

[Anonymous], 2007, Information and Complexity in Statistical Modeling

[4] The minimum description length principle in coding and modeling [J].