Large-Scale Simultaneous Testing Using Kernel Density Estimation

被引:0
作者
Santu Ghosh
Alan M. Polansky
机构
[1] Augusta University,
[2] Northern Illinois University,undefined
来源
Sankhya A | 2022年 / 84卷 / 2期
关键词
Two-sample t-test; Kernel density estimator; Edgeworth expansion; False discovery rate; Primary 62F03; Secondary 62G10;
D O I
暂无
中图分类号
学科分类号
摘要
A century ago, when Student’s t-statistic was introduced, no one ever imagined its increasing applicability in the modern era. It finds applications in highly multiple hypothesis testing, feature selection and ranking, high dimensional signal detection, etc. Student’s t-statistic is constructed based on the empirical distribution function (EDF). An alternative choice to the EDF is the kernel density estimate (KDE), which is a smoothed version of the EDF. The novelty of the work consists of an alternative to Student’s t-test that uses the KDE technique and exploration of the usefulness of KDE based t-test in the context of its application to large-scale simultaneous hypothesis testing. An optimal bandwidth parameter for the KDE approach is derived by minimizing the asymptotic error between the true p-value and its asymptotic estimate based on normal approximation. If the KDE-based approach is used for large-scale simultaneous testing, then it is interesting to consider, when does the method fail to manage the error rate? We show that the suggested KDE-based method can control false discovery rate (FDR) if total number tests diverge at a smaller order of magnitude than N3/2, where N is the total sample size. We compare our method to several possible alternatives with respect to FDR. We show in simulations that our method produces a lower proportion of false discoveries than its competitors. That is, our method better controls the false discovery rate than its competitors. Through these empirical studies, it is shown that the proposed method can be successfully applied in practice. The usefulness of the proposed methods is further illustrated through a gene expression data example.
引用
收藏
页码:808 / 843
页数:35
相关论文
共 38 条
[1]  
Benjamini Y(1995)Controlling the false discovery rate: a practical and powerful approach to multiple testing J. R. Statist. Soc. Ser. B 57 289-300
[2]  
Hochberg Y(2007)To how many simultaneous hypothesis tests can normal, student’s t or bootstrap calibration be applied? J. Am. Statist. Assoc. 102 1282-1288
[3]  
Fan J(2014)Smoothed and iterated bootstrap confidence regions for parameter vectors J. Multivar. Statist. 132 172-182
[4]  
Hall P(1998)On the sampling window method for long-range dependent data Statist. Sin. 8 1189-1204
[5]  
Yao Q(1988)A sharper Bonferroni procedure for multiple tests of significance Biometrika 75 800-802
[6]  
Ghosh S(1988)Stagewise rejective multiple test procedure based on a modified Bonferroni test Biometrika 75 383-386
[7]  
Polansky AM(2014)Leukemia and small round blue-cell tumor cancer detection using microarray gene expression data set: combining data dimension reduction and variable selection technique Chemom. Intell. Lab. Syst. 139 6-14
[8]  
Hall P(2001)Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks Nat. Med. 7 673-679
[9]  
Jing BY(2014)Phase transition and regularized bootstrap in large-scale tt-tests with false discovery rate control Ann. Statist. 42 2003-2025
[10]  
Lahiri SN(2009)Comparison of small n statistical tests of differential expression applied to microarrays BMC Bioinformatics 10 45-349