Enhanced Cross-Validation Methods Leveraging Clustering Techniques

被引:0
|
作者
Yucelbas, Cuneyt [1 ]
Yucelbas, Sule [2 ]
机构
[1] Tarsus Univ, Dept Elect & Automat, TR-33400 Mersin, Turkiye
[2] Tarsus Univ, Comp Engn Dept, TR-33400 Mersin, Turkiye
关键词
large-scale classification; cross-validation methodology; k-means; k-medoids; clustering techniques; CLASSIFIERS; SELECTION;
D O I
10.18280/ts.400626
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The efficacy of emerging and established learning algorithms warrants scrutiny. This examination is intrinsically linked to the results of classification performance. The primary determinant influencing these results is the distribution of the training and test data presented to the algorithms. Existing literature frequently employs standard and stratified (S-CV and St-CV) k-fold cross-validation methods for the creation of training and test data for classification tasks. In the S-CV method, training and test groups are formed via random data distribution, potentially undermining the reliability of performance results calculated post-classification. This study introduces innovative cross-validation strategies based on k -means and k-medoids clustering to address this challenge. These strategies are designed to tackle issues emerging from random data distribution. The proposed methods autonomously determine the number of clusters and folds. Initially, the number of clusters is established via Silhouette analysis, followed by identifying the number of folds according to the data volume within these clusters. An additional aim of this study is to minimize the standard deviation (Std) values between the folds. Particularly in classifying large datasets, the minimized Std negates the need to present each fold to the system, thereby reducing time expenditure and system congestion/fatigue. Analyses were carried out on several large-scale datasets to demonstrate the superiority of these new CV methods over the S-CV and St-CV techniques. The findings revealed superior performance results for the novel strategies. For instance, while the minimum Std value between folds was 0.022, the maximum accuracy rate achieved was approximately 100%. Owing to the proposed methods, the discrepancy between the performance outputs of each fold and the overall average is statistically minimized. The randomness in creating the training/test groups, which has been previously identified as a negative contributing factor to this discrepancy, has been significantly reduced. Hence, this study is anticipated to fill a critical and substantial gap in the existing literature concerning the formation of training/test groups in various classification problems and the statistical accuracy of performance results.
引用
收藏
页码:2649 / 2660
页数:12
相关论文
共 50 条
  • [31] Optimality of training/test size and resampling effectiveness in cross-validation
    Afendras, Georgios
    Markatou, Marianthi
    JOURNAL OF STATISTICAL PLANNING AND INFERENCE, 2019, 199 : 286 - 301
  • [32] Leveraging momentum clustering and PID control for enhanced portfolio management
    Acharya, Ameiy
    Gupta, Saket
    Kumba, Krishna
    Upender, Patri
    COGENT ECONOMICS & FINANCE, 2025, 13 (01):
  • [33] Impact of Regressand Stratification in Dataset Shift Caused by Cross-Validation
    Saez, Jose A.
    Romero-Bejar, Jose L.
    MATHEMATICS, 2022, 10 (14)
  • [34] Accuracies of genomic breeding values for growth and carcass traits in Brangus beef cattle using K-means clustering for cross-validation
    Peters, S.
    Sinecen, M.
    Kizilkaya, K.
    Yildiz, M.
    Garrick, D.
    Thomas, M.
    JOURNAL OF ANIMAL SCIENCE, 2018, 96 : 142 - 142
  • [35] Averaging estimators for discrete choice by M-fold cross-validation
    Zhao, Shangwei
    Zhou, Jianhong
    Yang, Guangren
    ECONOMICS LETTERS, 2019, 174 : 65 - 69
  • [36] Cross-validation under separate sampling: strong bias and how to correct it
    Braga-Neto, Ulisses M.
    Zollanvari, Amin
    Dougherty, Edward R.
    BIOINFORMATICS, 2014, 30 (23) : 3349 - 3355
  • [37] Double one-sided cross-validation of local linear hazards
    Luz Gamiz, Maria
    Mammen, Enno
    Martinez Miranda, Maria Dolores
    Nielsen, Jens Perch
    JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 2016, 78 (04) : 755 - 779
  • [38] A comparison of material flow strength models using Bayesian cross-validation
    Bernstein, Jason
    Schmidt, Kathleen
    Rivera, David
    Barton, Nathan
    Florando, Jeffrey
    Kupresanin, Ana
    COMPUTATIONAL MATERIALS SCIENCE, 2019, 169
  • [39] Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure
    Roberts, David R.
    Bahn, Volker
    Ciuti, Simone
    Boyce, Mark S.
    Elith, Jane
    Guillera-Arroita, Gurutzeta
    Hauenstein, Severin
    Lahoz-Monfort, Jose J.
    Schroeder, Boris
    Thuiller, Wilfried
    Warton, David I.
    Wintle, Brendan A.
    Hartig, Florian
    Dormann, Carsten F.
    ECOGRAPHY, 2017, 40 (08) : 913 - 929
  • [40] Two cross-validation techniques to comprehensively characterize global horizontal irradiation regression models: Single data-splitting is insufficient
    De Souza, Keith
    JOURNAL OF RENEWABLE AND SUSTAINABLE ENERGY, 2019, 11 (06)