Local neighborhood encodings for imbalanced data classification

被引:0
作者
Koziarski, Michal [1 ]
Wozniak, Michal [1 ]
机构
[1] Wroclaw Univ Sci & Technol, Dept Syst & Comp Networks, Wybrzeze Wyspianskiego 27, PL-50370 Wroclaw, Poland
关键词
Machine learning; Imbalanced data; Oversampling; Undersampling; Evolutionary algorithm; SMOTE; CHALLENGES;
D O I
10.1007/s10994-024-06563-6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper aims to propose Local Neighborhood Encodings (LNE)-a hybrid data preprocessing method dedicated to skewed class distribution balancing. The proposed LNE algorithm uses both over- and undersampling methods. The intensity of the methods is chosen separately for each fraction of minority and majority class objects. It is selected depending on the type of neighborhoods of objects of a given class, understood as the number of neighbors from the same class closest to a given object. The process of selecting the over- and undersampling intensities is treated as an optimization problem for which an evolutionary algorithm is used. The quality of the proposed method was evaluated through computer experiments. Compared with SOTA resampling strategies, LNE shows very good results. In addition, an experimental analysis of the algorithms behavior was performed, i.e., the determination of data preprocessing parameters depending on the selected characteristics of the decision problem, as well as the type of classifier used. An ablation study was also performed to evaluate the influence of components on the quality of the obtained classifiers. The evaluation of how the quality of classification is influenced by the evaluation of the objective function in an evolutionary algorithm is presented. In the considered task, the objective function is not de facto deterministic and its value is subject to estimation. Hence, it was important from the point of view of computational efficiency to investigate the possibility of using for quality assessment the so-called proxy classifier, i.e., a classifier of low computational complexity, although the final model was learned using a different model. The proposed data preprocessing method has high quality compared to SOTA, however, it should be noted that it requires significantly more computational effort. Nevertheless, it can be successfully applied to the case as no very restrictive model building time constraints are imposed.
引用
收藏
页码:7421 / 7449
页数:29
相关论文
共 50 条
[1]  
Alcalá-Fdez J, 2011, J MULT-VALUED LOG S, V17, P255
[2]   Combined 5 x 2 cv F test for comparing supervised classification learning algorithms [J].
Alpaydin, E .
NEURAL COMPUTATION, 1999, 11 (08) :1885-1892
[3]  
Barandela R, 2005, FRONT ARTIF INTEL AP, V131, P215
[4]  
Batista G.E., 2004, ACM SIGKDD EXPLORATI, V6, P20
[5]   A Survey of Predictive Modeling on Im balanced Domains [J].
Branco, Paula ;
Torgo, Luis ;
Ribeiro, Rita P. .
ACM COMPUTING SURVEYS, 2016, 49 (02)
[6]   On the Dynamics of Classification Measures for Imbalanced and Streaming Data [J].
Brzezinski, Dariusz ;
Stefanowski, Jerzy ;
Susmaga, Robert ;
Szczech, Izabela .
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2020, 31 (08) :2868-2878
[7]   Visual-based analysis of classification measures and their properties for class imbalanced problems [J].
Brzezinski, Dariusz ;
Stefanowski, Jerzy ;
Susmaga, Robert ;
Szczech, Izabela .
INFORMATION SCIENCES, 2018, 462 :242-261
[8]  
Bunkhumpornpat C, 2009, LECT NOTES ARTIF INT, V5476, P475, DOI 10.1007/978-3-642-01307-2_43
[9]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[10]   A fast and elitist multiobjective genetic algorithm: NSGA-II [J].
Deb, K ;
Pratap, A ;
Agarwal, S ;
Meyarivan, T .
IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, 2002, 6 (02) :182-197