What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis?

被引：217

作者：

Marcot, Bruce G. ^{[1
]}

Hanea, Anca M. ^{[2
]}

机构：

[1] US Forest Serv, Pacific Northwest Res Stn, Portland, OR 97208 USA

[2] Univ Melbourne, Ctr Excellence Biosecur Risk Anal, Parkville, Vic 3010, Australia

来源：

COMPUTATIONAL STATISTICS | 2021年 / 36卷 / 03期

关键词：

Model validation; Classification error; randomized subsets; sample size; MODEL SELECTION;

D O I：

10.1007/s00180-020-00999-9

中图分类号：

O21 [概率论与数理统计]; C8 [统计学];

学科分类号：

020208 ; 070103 ; 0714 ;

摘要：

Cross-validation using randomized subsets of data-known as k-fold cross-validation-is a powerful means of testing the success rate of models used for classification. However, few if any studies have explored how values of k (number of subsets) affect validation results in models tested with data of known statistical properties. Here, we explore conditions of sample size, model structure, and variable dependence affecting validation outcomes in discrete Bayesian networks (BNs). We created 6 variants of a BN model with known properties of variance and collinearity, along with data sets of n = 50, 500, and 5000 samples, and then tested classification success and evaluated CPU computation time with seven levels of folds (k = 2, 5, 10, 20, n - 5, n - 2, and n - 1). Classification error declined with increasing n, particularly in BN models with high multivariate dependence, and declined with increasing k, generally levelling out at k = 10, although k = 5 sufficed with large samples (n = 5000). Our work supports the common use of k = 10 in the literature, although in some cases k = 5 would suffice with BN models having independent variable structures.

引用

页码：2009 / 2031

页数：23

共 37 条

[1] Hybrid Bayesian network classifiers: Application to species distribution models
Aguilera, P. A.
Fernandez, A.
Reche, F.
Rumi, R.
[J]. ENVIRONMENTAL MODELLING & SOFTWARE, 2010, 25 (12) : 1630 - 1639
[2] A novel definition of the multivariate coefficient of variation
Albert, Adelin
Zhang, Lixin
[J]. BIOMETRICAL JOURNAL, 2010, 52 (05) : 667 - 675
[3] ANGUITA D, 2012, P ESANN 2012 EUR S A
[4] [Anonymous], 2001, THESIS MIT CAMBRIDGE
[5] A survey of cross-validation procedures for model selection
Arlot, Sylvain
Celisse, Alain
[J]. STATISTICS SURVEYS, 2010, 4 : 40 - 79
[6] Gyrfalcon nest distribution in Alaska based on a predictive GIS model
Booms, Travis L.
Huettmann, Falk
Schempf, Philip F.
[J]. POLAR BIOLOGY, 2010, 33 (03) : 347 - 358
[7] Calibrating vascular plant abundance for detecting future climate changes in Oregon and Washington, USA
Brady, Timothy J.
Monleon, Vicente J.
Gray, Andrew N.
[J]. ECOLOGICAL INDICATORS, 2010, 10 (03) : 657 - 667
[8] SUBMODEL SELECTION AND EVALUATION IN REGRESSION - THE X-RANDOM CASE
BREIMAN, L
SPECTOR, P
[J]. INTERNATIONAL STATISTICAL REVIEW, 1992, 60 (03) : 291 - 319
[9] Cawley GC, 2007, J MACH LEARN RES, V8, P841
[10] From complex questionnaire and interviewing data to intelligent Bayesian network models for medical decision support
Constantinou, Anthony Costa
Fenton, Norman
Marsh, William
Radlinski, Lukasz
[J]. ARTIFICIAL INTELLIGENCE IN MEDICINE, 2016, 67 : 75 - 93

← 1 2 3 4 →