Cross validation for model selection: A review with examples from ecology

被引：94

作者：

Yates, Luke A. ^{[1
]}

Aandahl, Zach ^{[1
]}

Richards, Shane A. ^{[1
]}

Brook, Barry W. ^{[1
]}

机构：

[1] Univ Tasmania, Sch Nat Sci, Hobart, Tas, Australia

来源：

ECOLOGICAL MONOGRAPHS | 2023年 / 93卷 / 01期

基金：

澳大利亚研究理事会;

关键词：

cross validation; information theory; model selection; overfitting; parsimony; post-selection inference; INFERENCE; INFORMATION; PREDICTION; REGRESSION; REGULARIZATION; GROWTH;

D O I：

10.1002/ecm.1557

中图分类号：

Q14 [生态学（生物生态学）];

学科分类号：

071012 ; 0713 ;

摘要：

Specifying, assessing, and selecting among candidate statistical models is fundamental to ecological research. Commonly used approaches to model selection are based on predictive scores and include information criteria such as Akaike's information criterion, and cross validation. Based on data splitting, cross validation is particularly versatile because it can be used even when it is not possible to derive a likelihood (e.g., many forms of machine learning) or count parameters precisely (e.g., mixed-effects models). However, much of the literature on cross validation is technical and spread across statistical journals, making it difficult for ecological analysts to assess and choose among the wide range of options. Here we provide a comprehensive, accessible review that explains important-but often overlooked-technical aspects of cross validation for model selection, such as: bias correction, estimation uncertainty, choice of scores, and selection rules to mitigate overfitting. We synthesize the relevant statistical advances to make recommendations for the choice of cross-validation technique and we present two ecological case studies to illustrate their application. In most instances, we recommend using exact or approximate leave-one-out cross validation to minimize bias, or otherwise k-fold with bias correction if k < 10. To mitigate overfitting when using cross validation, we recommend calibrated selection via our recently introduced modified one-standard-error rule. We advocate for the use of predictive scores in model selection across a range of typical modeling goals, such as exploration, hypothesis testing, and prediction, provided that models are specified in accordance with the stated goal. We also emphasize, as others have done, that inference on parameter estimates is biased if preceded by model selection and instead requires a carefully specified single model or further technical adjustments.

引用

页数：24

共 50 条

[1] On Estimating Model in Feature Selection With Cross-Validation
Qi, Chunxia
Diao, Jiandong
Qiu, Like
IEEE ACCESS, 2019, 7 : 33454 - 33463
[2] Cross-validation for selecting a model selection procedure
Zhang, Yongli
Yang, Yuhong
JOURNAL OF ECONOMETRICS, 2015, 187 (01) : 95 - 112
[3] Weighted cross validation in model selection
Markatou, Marianthi
Afendras, Georgios
Agostinelli, Claudio
WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL STATISTICS, 2018, 10 (06):
[4] Linear model selection by cross-validation
Rao, CR
Wu, Y
JOURNAL OF STATISTICAL PLANNING AND INFERENCE, 2005, 128 (01) : 231 - 240
[5] A survey of cross-validation procedures for model selection
Arlot, Sylvain
Celisse, Alain
STATISTICS SURVEYS, 2010, 4 : 40 - 79
[6] Evaluation of BIC and Cross Validation for model selection on sequence segmentations
Haiminen, Niina
Mannila, Heikki
INTERNATIONAL JOURNAL OF DATA MINING AND BIOINFORMATICS, 2010, 4 (06) : 675 - 700
[7] Empirical Comparison Between Cross-Validation and Mutation-Validation in Model Selection
Yu, Jinyang
Hamdan, Sami
Sasse, Leonard
Morrison, Abigail
Patil, Kaustubh R.
ADVANCES IN INTELLIGENT DATA ANALYSIS XXII, PT II, IDA 2024, 2024, 14642 : 56 - 67
[8] Consistent cross-validatory model-selection for dependent data:: hv-block cross-validation
Racine, J
JOURNAL OF ECONOMETRICS, 2000, 99 (01) : 39 - 61
[9] Managing the computational cost of model selection and cross-validation in extreme learning machines via Cholesky, SVD, QR and eigen decompositions
Kokkinos, Yiannis
Margaritis, Konstantinos G.
NEUROCOMPUTING, 2018, 295 : 29 - 45
[10] MODEL SELECTION VIA MULTIFOLD CROSS-VALIDATION
ZHANG, P
ANNALS OF STATISTICS, 1993, 21 (01) : 299 - 313

← 1 2 3 4 5 →