A Comparative Study of the Use of Stratified Cross-Validation and Distribution-Balanced Stratified Cross-Validation in Imbalanced Learning

被引:38
|
作者
Szeghalmy, Szilvia [1 ]
Fazekas, Attila [1 ]
机构
[1] Univ Debrecen, Fac Informat, H-4028 Debrecen, Hungary
关键词
imbalanced learning; cross validation; SCV; DOB-SCV; SMOTE; CLASSIFICATION; RECOGNITION;
D O I
10.3390/s23042333
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Nowadays, the solution to many practical problems relies on machine learning tools. However, compiling the appropriate training data set for real-world classification problems is challenging because collecting the right amount of data for each class is often difficult or even impossible. In such cases, we can easily face the problem of imbalanced learning. There are many methods in the literature for solving the imbalanced learning problem, so it has become a serious question how to compare the performance of the imbalanced learning methods. Inadequate validation techniques can provide misleading results (e.g., due to data shift), which leads to the development of methods designed for imbalanced data sets, such as stratified cross-validation (SCV) and distribution optimally balanced SCV (DOB-SCV). Previous studies have shown that higher classification performance scores (AUC) can be achieved on imbalanced data sets using DOB-SCV instead of SCV. We investigated the effect of the oversamplers on this difference. The study was conducted on 420 data sets, involving several sampling methods and the DTree, kNN, SVM, and MLP classifiers. We point out that DOB-SCV often provides a little higher F1 and AUC values for classification combined with sampling. However, the results also prove that the selection of the sampler-classifier pair is more important for the classification performance than the choice between the DOB-SCV and the SCV techniques.
引用
收藏
页数:27
相关论文
共 50 条
  • [1] Distribution-balanced stratified cross-validation for accuracy estimation
    Zeng, XC
    Martinez, TR
    JOURNAL OF EXPERIMENTAL & THEORETICAL ARTIFICIAL INTELLIGENCE, 2000, 12 (01) : 1 - 12
  • [2] Stratified Cross-Validation on Multiple Columns
    Motl, Jan
    Kordik, Pavel
    2021 IEEE 33RD INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2021), 2021, : 26 - 31
  • [3] Cross-validation Strategies for Balanced and Imbalanced Datasets
    Fontanari, Thomas
    Froes, Tiago Comassetto
    Recamonde-Mendoza, Mariana
    INTELLIGENT SYSTEMS, PT I, 2022, 13653 : 626 - 640
  • [4] Cross-Validation for Imbalanced Datasets: Avoiding Overoptimistic and Overfitting Approaches
    Santos, Miriam Seoane
    Soares, Jastin Pompeu
    Abreu, Pedro Henriques
    Araujo, Helder
    Santos, Joao
    IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE, 2018, 13 (04) : 59 - 76
  • [5] Cross-validation and permutations in MVPA: Validity of permutation strategies and power of cross-validation schemes
    Valente, Giancarlo
    Castellanos, Agustin Lage
    Hausfeld, Lars
    De Martino, Federico
    Formisano, Elia
    NEUROIMAGE, 2021, 238
  • [6] A cross-validation framework to find a better state than the balanced one for oversampling in imbalanced classification
    Dai, Qizhu
    Li, Donggen
    Xia, Shuyin
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2023, 14 (08) : 2877 - 2886
  • [7] Cross-validation on extreme regions
    Aghbalou, Anass
    Bertail, Patrice
    Portier, Francois
    Sabourin, Anne
    EXTREMES, 2024, 27 (04) : 505 - 555
  • [8] The uncertainty principle of cross-validation
    Last, Mark
    2006 IEEE International Conference on Granular Computing, 2006, : 275 - 280
  • [9] Cross-Validation With Confidence
    Lei, Jing
    JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2020, 115 (532) : 1978 - 1997
  • [10] Targeted cross-validation
    Zhang, Jiawei
    Ding, Jie
    Yang, Yuhong
    BERNOULLI, 2023, 29 (01) : 377 - 402