What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis?

被引:217
作者
Marcot, Bruce G. [1 ]
Hanea, Anca M. [2 ]
机构
[1] US Forest Serv, Pacific Northwest Res Stn, Portland, OR 97208 USA
[2] Univ Melbourne, Ctr Excellence Biosecur Risk Anal, Parkville, Vic 3010, Australia
关键词
Model validation; Classification error; randomized subsets; sample size; MODEL SELECTION;
D O I
10.1007/s00180-020-00999-9
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Cross-validation using randomized subsets of data-known as k-fold cross-validation-is a powerful means of testing the success rate of models used for classification. However, few if any studies have explored how values of k (number of subsets) affect validation results in models tested with data of known statistical properties. Here, we explore conditions of sample size, model structure, and variable dependence affecting validation outcomes in discrete Bayesian networks (BNs). We created 6 variants of a BN model with known properties of variance and collinearity, along with data sets of n = 50, 500, and 5000 samples, and then tested classification success and evaluated CPU computation time with seven levels of folds (k = 2, 5, 10, 20, n - 5, n - 2, and n - 1). Classification error declined with increasing n, particularly in BN models with high multivariate dependence, and declined with increasing k, generally levelling out at k = 10, although k = 5 sufficed with large samples (n = 5000). Our work supports the common use of k = 10 in the literature, although in some cases k = 5 would suffice with BN models having independent variable structures.
引用
收藏
页码:2009 / 2031
页数:23
相关论文
共 37 条
  • [1] Hybrid Bayesian network classifiers: Application to species distribution models
    Aguilera, P. A.
    Fernandez, A.
    Reche, F.
    Rumi, R.
    [J]. ENVIRONMENTAL MODELLING & SOFTWARE, 2010, 25 (12) : 1630 - 1639
  • [2] A novel definition of the multivariate coefficient of variation
    Albert, Adelin
    Zhang, Lixin
    [J]. BIOMETRICAL JOURNAL, 2010, 52 (05) : 667 - 675
  • [3] ANGUITA D, 2012, P ESANN 2012 EUR S A
  • [4] [Anonymous], 2001, THESIS MIT CAMBRIDGE
  • [5] A survey of cross-validation procedures for model selection
    Arlot, Sylvain
    Celisse, Alain
    [J]. STATISTICS SURVEYS, 2010, 4 : 40 - 79
  • [6] Gyrfalcon nest distribution in Alaska based on a predictive GIS model
    Booms, Travis L.
    Huettmann, Falk
    Schempf, Philip F.
    [J]. POLAR BIOLOGY, 2010, 33 (03) : 347 - 358
  • [7] Calibrating vascular plant abundance for detecting future climate changes in Oregon and Washington, USA
    Brady, Timothy J.
    Monleon, Vicente J.
    Gray, Andrew N.
    [J]. ECOLOGICAL INDICATORS, 2010, 10 (03) : 657 - 667
  • [8] SUBMODEL SELECTION AND EVALUATION IN REGRESSION - THE X-RANDOM CASE
    BREIMAN, L
    SPECTOR, P
    [J]. INTERNATIONAL STATISTICAL REVIEW, 1992, 60 (03) : 291 - 319
  • [9] Cawley GC, 2007, J MACH LEARN RES, V8, P841
  • [10] From complex questionnaire and interviewing data to intelligent Bayesian network models for medical decision support
    Constantinou, Anthony Costa
    Fenton, Norman
    Marsh, William
    Radlinski, Lukasz
    [J]. ARTIFICIAL INTELLIGENCE IN MEDICINE, 2016, 67 : 75 - 93