Correcting the Optimal Resampling-Based Error Rate by Estimating the Error Rate of Wrapper Algorithms

被引:15
作者
Bernau, Christoph [1 ]
Augustin, Thomas [2 ]
Boulesteix, Anne-Laure [1 ]
机构
[1] Dept Med Informat Biometry & Epidemiol, D-81377 Munich, Germany
[2] Univ Munich, Dept Stat, D-80539 Munich, Germany
关键词
Classification; High-dimensional data; Method selection bias; Repeated subsampling; Tuning bias; BIAS;
D O I
10.1111/biom.12041
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
High-dimensional binary classification tasks, for example, the classification of microarray samples into normal and cancer tissues, usually involve a tuning parameter. By reporting the performance of the best tuning parameter value only, over-optimistic prediction errors are obtained. For correcting this tuning bias, we develop a new method which is based on a decomposition of the unconditional error rate involving the tuning procedure, that is, we estimate the error rate of wrapper algorithms as introduced in the context of internal cross-validation (ICV) by Varma and Simon (2006, BMC Bioinformatics 7, 91). Our subsampling-based estimator can be written as a weighted mean of the errors obtained using the different tuning parameter values, and thus can be interpreted as a smooth version of ICV, which is the standard approach for avoiding tuning bias. In contrast to ICV, our method guarantees intuitive bounds for the corrected error. Additionally, we suggest to use bias correction methods also to address the conceptually similar method selection bias that results from the optimal choice of the classification method itself when evaluating several methods successively. We demonstrate the performance of our method on microarray and simulated data and compare it to ICV. This study suggests that our approach yields competitive estimates at a much lower computational price.
引用
收藏
页码:693 / 702
页数:10
相关论文
共 10 条
[1]  
[Anonymous], 2009, CMA SYNTHESIS MICROA
[2]   Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction [J].
Boulesteix, Anne-Laure ;
Strobl, Carolin .
BMC MEDICAL RESEARCH METHODOLOGY, 2009, 9
[3]   Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting [J].
Dupuy, Alain ;
Simon, Richard M. .
JNCI-JOURNAL OF THE NATIONAL CANCER INSTITUTE, 2007, 99 (02) :147-157
[4]  
Genz A., 2021, MVTNORM MULTIVARIATE
[5]   Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings [J].
Hanczar, Blaise K ;
Hua, Jianping ;
Dougherty, Edward R. .
EURASIP JOURNAL ON BIOINFORMATICS AND SYSTEMS BIOLOGY, 2007, (01)
[6]   COMPUTING A NEAREST SYMMETRIC POSITIVE SEMIDEFINITE MATRIX [J].
HIGHAM, NJ .
LINEAR ALGEBRA AND ITS APPLICATIONS, 1988, 103 :103-118
[7]   Over-optimism in bioinformatics: an illustration [J].
Jelizarow, Monika ;
Guillemot, Vincent ;
Tenenhaus, Arthur ;
Strimmer, Korbinian ;
Boulesteix, Anne-Laure .
BIOINFORMATICS, 2010, 26 (16) :1990-1998
[8]   Inference for the generalization error [J].
Nadeau, C ;
Bengio, Y .
MACHINE LEARNING, 2003, 52 (03) :239-281
[9]   A BIAS CORRECTION FOR THE MINIMUM ERROR RATE IN CROSS-VALIDATION [J].
Tibshirani, Ryan J. ;
Tibshirani, Robert .
ANNALS OF APPLIED STATISTICS, 2009, 3 (02) :822-829
[10]   Bias in error estimation when using cross-validation for model selection [J].
Varma, S ;
Simon, R .
BMC BIOINFORMATICS, 2006, 7 (1)