Optimality of training/test size and resampling effectiveness in cross-validation

被引:41
作者
Afendras, Georgios [1 ,2 ]
Markatou, Marianthi [1 ]
机构
[1] SUNY Buffalo, Dept Biostat, Buffalo, NY 14260 USA
[2] SUNY Buffalo, Jacobs Sch Med & Biomed Sci, Buffalo, NY USA
关键词
Cross-validation estimator; Generalization error; Optimality; Resampling effectiveness; Training sample size; ERROR RATE; STATISTICAL COMPARISONS; MODEL SELECTION; CLASSIFICATION; CLASSIFIERS; ESTIMATORS; VARIANCE; SET;
D O I
10.1016/j.jspi.2018.07.005
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
An important question in cross-validation (CV) is whether rules can be established to allow optimal sample size selection of the training/test set, for fixed values of the total sample size n. We study the cases of repeated train-test CV and k-fold CV for certain decision rules that are used frequently. We begin by defining the resampling effectiveness of repeated train-test CV estimators of the generalization error and study its relation to optimal training sample size selection. We then define optimality via simple statistical rules that allow us to select the optimal training sample size and the number of folds. We show that: (1) there exist decision rules for which closed form solutions of the optimal training/test sample size can be obtained; (2) in a broad class of loss functions the optimal training sample size equals half of the total sample size, independently of the data distribution and the data analytic task. We study optimal selection of the number of folds in k-fold CV and address the case of classification via logistic regression and support vector machines, substantiating our claims theoretically and empirically in both, small and large sample sizes. We contrast our results with standard practice in the use of CV. (C) 2018 Elsevier B.V. All rights reserved.
引用
收藏
页码:286 / 301
页数:16
相关论文
共 45 条
[1]   Uniform integrability of the OLS estimators, and the convergence of their moments [J].
Afendras, Georgios ;
Markatou, Marianthi .
TEST, 2016, 25 (04) :775-784
[2]   An experimental comparison of cross-validation techniques for estimating the area under the ROC curve [J].
Airola, Antti ;
Pahikkala, Tapio ;
Waegeman, Willem ;
De Baets, Bernard ;
Salakoski, Tapio .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2011, 55 (04) :1828-1844
[3]   A survey of cross-validation procedures for model selection [J].
Arlot, Sylvain ;
Celisse, Alain .
STATISTICS SURVEYS, 2010, 4 :40-79
[4]  
Bengio Y, 2004, J MACH LEARN RES, V5, P1089
[5]   Training samples in objective Bayesian model selection [J].
Berger, JO ;
Pericchi, LR .
ANNALS OF STATISTICS, 2004, 32 (03) :841-869
[6]   Is cross-validation valid for small-sample microarray classification? [J].
Braga-Neto, UM ;
Dougherty, ER .
BIOINFORMATICS, 2004, 20 (03) :374-380
[7]  
BURMAN P, 1989, BIOMETRIKA, V76, P503, DOI 10.2307/2336116
[8]  
BURMAN P, 1990, SANKHYA SER A, V52, P314
[9]   A new minimal training sample scheme for intrinsic Bayes factors in censored data [J].
Cabras, Stefano ;
Eugenia Castellanos, Maria ;
Perra, Silvia .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2015, 81 :52-63
[10]  
Cawley GC, 2010, J MACH LEARN RES, V11, P2079