Cross validation for model selection: A review with examples from ecology

被引:94
作者
Yates, Luke A. [1 ]
Aandahl, Zach [1 ]
Richards, Shane A. [1 ]
Brook, Barry W. [1 ]
机构
[1] Univ Tasmania, Sch Nat Sci, Hobart, Tas, Australia
基金
澳大利亚研究理事会;
关键词
cross validation; information theory; model selection; overfitting; parsimony; post-selection inference; INFERENCE; INFORMATION; PREDICTION; REGRESSION; REGULARIZATION; GROWTH;
D O I
10.1002/ecm.1557
中图分类号
Q14 [生态学(生物生态学)];
学科分类号
071012 ; 0713 ;
摘要
Specifying, assessing, and selecting among candidate statistical models is fundamental to ecological research. Commonly used approaches to model selection are based on predictive scores and include information criteria such as Akaike's information criterion, and cross validation. Based on data splitting, cross validation is particularly versatile because it can be used even when it is not possible to derive a likelihood (e.g., many forms of machine learning) or count parameters precisely (e.g., mixed-effects models). However, much of the literature on cross validation is technical and spread across statistical journals, making it difficult for ecological analysts to assess and choose among the wide range of options. Here we provide a comprehensive, accessible review that explains important-but often overlooked-technical aspects of cross validation for model selection, such as: bias correction, estimation uncertainty, choice of scores, and selection rules to mitigate overfitting. We synthesize the relevant statistical advances to make recommendations for the choice of cross-validation technique and we present two ecological case studies to illustrate their application. In most instances, we recommend using exact or approximate leave-one-out cross validation to minimize bias, or otherwise k-fold with bias correction if k < 10. To mitigate overfitting when using cross validation, we recommend calibrated selection via our recently introduced modified one-standard-error rule. We advocate for the use of predictive scores in model selection across a range of typical modeling goals, such as exploration, hypothesis testing, and prediction, provided that models are specified in accordance with the stated goal. We also emphasize, as others have done, that inference on parameter estimates is biased if preceded by model selection and instead requires a carefully specified single model or further technical adjustments.
引用
收藏
页数:24
相关论文
共 50 条
  • [21] Bayesian Model Selection in Fisheries Management and Ecology
    Doll, Jason C.
    Jacquemin, Stephen J.
    JOURNAL OF FISH AND WILDLIFE MANAGEMENT, 2019, 10 (02): : 691 - 707
  • [22] Granularity selection for cross-validation of SVM
    Liu, Yong
    Liao, Shizhong
    INFORMATION SCIENCES, 2017, 378 : 475 - 483
  • [23] Ascertainment of the number of samples in the validation set in Monte Carlo cross validation and the selection of model dimension with Monte Carlo cross validation
    Du, Yi Ping
    Kasemsumran, Surnaporn
    Maruo, Katsuhiko
    Nakagawa, Takehiro
    Ozaki, Yukihiro
    CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2006, 82 (1-2) : 83 - 89
  • [24] Selection and validation of a complex fishery model using an uncertainty hierarchy
    Lehuta, Sigrid
    Petitgas, Pierre
    Mahevas, Stephanie
    Huret, Martin
    Vermard, Youen
    Uriarte, Andres
    Record, Nicholas R.
    FISHERIES RESEARCH, 2013, 143 : 57 - 66
  • [25] General Approximate Cross Validation for Model Selection: Supervised, Semi-supervised and Pairwise Learning
    Zhu, Bowei
    Liu, Yong
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5281 - 5289
  • [26] A penalized approach to mixed model selection via cross-validation
    Xiong, Jingwei
    Shang, Junfeng
    COMMUNICATIONS IN STATISTICS-THEORY AND METHODS, 2021, 50 (11) : 2481 - 2507
  • [27] An unbiased model comparison test using cross-validation
    Desmarais, Bruce A.
    Harden, Jeffrey J.
    QUALITY & QUANTITY, 2014, 48 (04) : 2155 - 2173
  • [28] Bayesian cross-validation for model evaluation and selection, with application to the North American Breeding Bird Survey
    Link, William A.
    Sauer, John R.
    ECOLOGY, 2016, 97 (07) : 1746 - 1758
  • [29] Model selection and model averaging in behavioural ecology: the utility of the IT-AIC framework
    Richards, Shane A.
    Whittingham, Mark J.
    Stephens, Philip A.
    BEHAVIORAL ECOLOGY AND SOCIOBIOLOGY, 2011, 65 (01) : 77 - 89
  • [30] Enhancing data pipelines for forecasting student performance: integrating feature selection with cross-validation
    Bertolini, Roberto
    Finch, Stephen J.
    Nehm, Ross H.
    INTERNATIONAL JOURNAL OF EDUCATIONAL TECHNOLOGY IN HIGHER EDUCATION, 2021, 18 (01)