Cross-Validation Visualized: A Narrative Guide to Advanced Methods

被引:42
作者
Allgaier, Johannes [1 ,2 ]
Pryss, Ruediger [1 ,2 ]
机构
[1] Univ Hosp Wurzburg, Inst Med Data Sci, D-97080 Wurzburg, Germany
[2] Julius Maximilians Univ Wurzburg, Inst Clin Epidemiol & Biometry, D-97080 Wurzburg, Germany
关键词
train test split; cross-validation; grouped cross-validation; stratified cross-validation; time-based cross-validation; block cross-validation;
D O I
10.3390/make6020065
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This study delves into the multifaceted nature of cross-validation (CV) techniques in machine learning model evaluation and selection, underscoring the challenge of choosing the most appropriate method due to the plethora of available variants. It aims to clarify and standardize terminology such as sets, groups, folds, and samples pivotal in the CV domain, and introduces an exhaustive compilation of advanced CV methods like leave-one-out, leave-p-out, Monte Carlo, grouped, stratified, and time-split CV within a hold-out CV framework. Through graphical representations, the paper enhances the comprehension of these methodologies, facilitating more informed decision making for practitioners. It further explores the synergy between different CV strategies and advocates for a unified approach to reporting model performance by consolidating essential metrics. The paper culminates in a comprehensive overview of the CV techniques discussed, illustrated with practical examples, offering valuable insights for both novice and experienced researchers in the field.
引用
收藏
页码:1378 / 1388
页数:11
相关论文
共 20 条
[1]  
Baier L., 2019, ECIS, V1
[2]   Reconciling modern machine-learning practice and the classical bias-variance trade-off [J].
Belkin, Mikhail ;
Hsu, Daniel ;
Ma, Siyuan ;
Mandal, Soumik .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2019, 116 (32) :15849-15854
[3]   On the use of cross-validation for time series predictor evaluation [J].
Bergmeir, Christoph ;
Benitez, Jose M. .
INFORMATION SCIENCES, 2012, 191 :192-213
[4]  
Berrar D., 2018, Encyclopedia of Bioinformatics and Computational Biology, V1st, P542, DOI [DOI 10.1016/B978-0-12-809633-8.20349-X, 10.1016/B978-0-12-809633-8.20349-X, 10.1016/B978-012-809633-8.20349-X]
[5]   Nonparametric density estimation by exact leave-p-out cross-validation [J].
Celisse, Alain ;
Robin, Stephane .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2008, 52 (05) :2350-2368
[6]  
Chapman P., 2000, Tech. Rep.
[7]  
Dubitzky W., 2007, FUNDAMENTALS DATA MI
[8]  
Hart P.E., 2000, Pattern Classification, DOI DOI 10.5555/954544
[9]  
Hastie T., 2009, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, VVol. 2, ppp. 1
[10]   The problem of overfitting [J].
Hawkins, DM .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2004, 44 (01) :1-12