Neighborhood repartition-based oversampling algorithm for multiclass imbalanced data with label noise

被引:2
作者
Shen, Shiyi [1 ]
Li, Zhixin [1 ]
Huan, Zhan [1 ]
Shang, Fanqi [1 ]
Wang, Yongsong [2 ]
Chen, Ying [1 ]
机构
[1] Changzhou Univ, Sch Microelect & Control Engn, Changzhou, Jiangsu, Peoples R China
[2] Changzhou Univ, Sch Comp & Artificial Intelligence, Changzhou, Jiangsu, Peoples R China
关键词
Multiclass imbalance; Oversampling; Neighborhood repartition; Label noise; Machine learning; SMOTE; CLASSIFICATION;
D O I
10.1016/j.neucom.2024.128090
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The classification of imbalanced data remains one of the most significant topics in contemporary data analysis. Existing classification algorithms tend to favor majority classes, leading to false predictions and difficulties in addressing class overlap or label noise. These challenges are particularly evident in multiclass settings, where the mutual imbalance relationships among classes become more complex. Despite this, the vast majority of research in this field has concentrated on binary problems, while the more difficult multiclass problems are relatively underexplored. In this paper, we propose a novel data-sampling technique, a Multiclass Neighborhood Repartition-based Oversampling (MC-NRO) algorithm. The innovation of this method lies in it considers local data characteristics of each class to constrain the oversampled neighborhood. MC-NRO calculates the mutual potential of different classes to precisely optimize the subregions for generating new instances. By selecting different repartition neighborhoods to meet the needs of specific domain, it can detect outliers and label noise, expand the decision boundary of minority class, and avoid class overlap through data cleaning. The experimental results demonstrate that MC-NRO outperforms other advanced oversampling strategies, ranking first on average across the three evaluation metrics, and exhibits robustness, especially in datasets with high noise levels. More importantly, MC-NRO is highly versatile and can be flexibly applied to various classifiers, and is particularly suitable for processing naturally complex (i.e., not affected by noise) datasets.
引用
收藏
页数:17
相关论文
共 46 条
[1]   To Combat Multi-Class Imbalanced Problems by Means of Over-Sampling Techniques [J].
Abdi, Lida ;
Hashemi, Sattar .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2016, 28 (01) :238-251
[2]  
Alcalá-Fdez J, 2011, J MULT-VALUED LOG S, V17, P255
[3]  
Batista GEAPA., 2004, ACM SIGKDD EXPL NEWS, V6, P20, DOI DOI 10.1145/1007730.1007735
[4]   Relevance-Based Evaluation Metrics for Multi-class Imbalanced Domains [J].
Branco, Paula ;
Torgo, Luis ;
Ribeiro, Rita P. .
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2017, PT I, 2017, 10234 :698-710
[5]  
Bunkhumpornpat C, 2009, LECT NOTES ARTIF INT, V5476, P475, DOI 10.1007/978-3-642-01307-2_43
[6]   Similarity encoding for learning with dirty categorical variables [J].
Cerda, Patricio ;
Varoquaux, Gael ;
Kegl, Balazs .
MACHINE LEARNING, 2018, 107 (8-10) :1477-1494
[7]   Revolt: Collaborative Crowdsourcing for Labeling Machine Learning Datasets [J].
Chang, Joseph Chee ;
Amershi, Saleema ;
Kamar, Ece .
PROCEEDINGS OF THE 2017 ACM SIGCHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS (CHI'17), 2017, :2334-2346
[8]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[9]  
Chen Qijun, 2003, PROC INT C INT C MAC, P920
[10]   PF-SMOTE: A novel parameter-free SMOTE for imbalanced datasets [J].
Chen, Qiong ;
Zhang, Zhong-Liang ;
Huang, Wen-Po ;
Wu, Jian ;
Luo, Xing-Gang .
NEUROCOMPUTING, 2022, 498 :75-88