Subsampling bias and the best-discrepancy systematic cross validation

被引:9
作者
Guo, Liang [1 ]
Liu, Jianya [2 ]
Lu, Ruodan [3 ,4 ]
机构
[1] Shandong Univ, Data Sci Inst, Weihai 264209, Peoples R China
[2] Shandong Univ, Data Sci Inst, Jinan 250100, Shandong, Peoples R China
[3] Loughborough Univ, Sch Architecture Bldg & Civil Engn, Loughborough LE11 3TU, Leics, England
[4] Univ Cambridge, Darwin Coll, Cambridge CB3 9EU, England
基金
中国国家自然科学基金;
关键词
subsampling bias; cross validation; systematic sampling; low-discrepancy sequence; best-discrepancy sequence; SEQUENCES; OPTIMIZATION; VALUATION; ERROR;
D O I
10.1007/s11425-018-9561-0
中图分类号
O29 [应用数学];
学科分类号
070104 ;
摘要
Statistical machine learning models should be evaluated and validated before putting to work. Conventional k-fold Monte Carlo cross-validation (MCCV) procedure uses a pseudo-random sequence to partition instances into k subsets, which usually causes subsampling bias, inflates generalization errors and jeopardizes the reliability and effectiveness of cross-validation. Based on ordered systematic sampling theory in statistics and low-discrepancy sequence theory in number theory, we propose a new k-fold cross-validation procedure by replacing a pseudo-random sequence with a best-discrepancy sequence, which ensures low subsampling bias and leads to more precise expected-prediction-error (EPE) estimates. Experiments with 156 benchmark datasets and three classifiers (logistic regression, decision tree and naive bayes) show that in general, our cross-validation procedure can extrude subsampling bias in the MCCV by lowering the EPE around 7.18% and the variances around 26.73%. In comparison, the stratified MCCV can reduce the EPE and variances of the MCCV around 1.58% and 11.85%, respectively. The leave-one-out (LOO) can lower the EPE around 2.50% but its variances are much higher than the any other cross-validation (CV) procedure. The computational time of our cross-validation procedure is just 8.64% of the MCCV, 8.67% of the stratified MCCV and 16.72% of the LOO. Experiments also show that our approach is more beneficial for datasets characterized by relatively small size and large aspect ratio. This makes our approach particularly pertinent when solving bioscience classification problems. Our proposed systematic subsampling technique could be generalized to other machine learning algorithms that involve random subsampling mechanism.
引用
收藏
页码:197 / 210
页数:14
相关论文
共 46 条
  • [1] [Anonymous], 1974, UNIFORM DISTRIBUTION
  • [2] [Anonymous], P SPRING C COMP GRAP
  • [3] [Anonymous], 2000, P 16 C UNCERTAINTY A, DOI DOI 10.5555/2073946.2073956
  • [4] ON SOME DIOPHANTINE INEQUALITIES INVOLVING EXPONENTIAL FUNCTION
    BAKER, A
    [J]. CANADIAN JOURNAL OF MATHEMATICS, 1965, 17 (04): : 616 - &
  • [5] Bergstra J, 2011, ADV NEURAL INFORM PR, P2546, DOI 10.5555/2986459.2986743
  • [6] Bergstra J, 2012, J MACH LEARN RES, V13, P281
  • [7] Monte Carlo methods for security pricing
    Boyle, P
    Broadie, M
    Glasserman, P
    [J]. JOURNAL OF ECONOMIC DYNAMICS & CONTROL, 1997, 21 (8-9) : 1267 - 1321
  • [8] Cross-validation under separate sampling: strong bias and how to correct it
    Braga-Neto, Ulisses M.
    Zollanvari, Amin
    Dougherty, Edward R.
    [J]. BIOINFORMATICS, 2014, 30 (23) : 3349 - 3355
  • [9] Is cross-validation valid for small-sample microarray classification?
    Braga-Neto, UM
    Dougherty, ER
    [J]. BIOINFORMATICS, 2004, 20 (03) : 374 - 380
  • [10] Branicky MS, 2001, IEEE INT CONF ROBOT, P1481, DOI 10.1109/ROBOT.2001.932820