A new data complexity measure for multi-class imbalanced classification tasks

被引:1
作者
Han, Mingming [1 ]
Guo, Husheng [1 ,2 ]
Wang, Wenjian [1 ,2 ]
机构
[1] Shanxi Univ, Sch Comp & Informat Technol, Taiyuan 030006, Shanxi, Peoples R China
[2] Shanxi Univ, Key Lab Computat Intelligence & Chinese Informat P, Minist Educ, Taiyuan 030006, Shanxi, Peoples R China
关键词
Data characteristic; Skewed distribution; Correlation; Multi-class;
D O I
10.1016/j.patcog.2024.110881
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The skewed class distribution and data complexity may severely affect the imbalanced classification results. The cost of classification can be significantly reduced if these data complexity are measured and pre-processed prior to training, particularly when dealing with large-scale and high-dimensional datasets. Although many methods have been proposed to evaluate data complexity, most of them fail to fully consider the interaction among different data characteristics, or the connection between class imbalance and these characteristics, thus posing a serious challenge to effectively evaluate the difficulty of classification. This paper presents a new data complexity measure MFII (multi-factor imbalance index), which measures the combined effects of the skewed class distribution and data characteristics on classification difficulty. In particular, it further comprehensively investigates the impact of overlap size, confusion degree, and sub-cluster structure. VoR (value of resolution) and DoC (degree of consistency) are also proposed to evaluate the resolution and interpretability of complexity measures. The experimental results demonstrate that MFII has lower VoR and a stronger correlation with classification metrics, which indicates that MFII can more accurately evaluate the difficulty of multi-class imbalanced classification tasks.
引用
收藏
页数:13
相关论文
共 33 条
[1]   To Combat Multi-Class Imbalanced Problems by Means of Over-Sampling Techniques [J].
Abdi, Lida ;
Hashemi, Sattar .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2016, 28 (01) :238-251
[2]   Measurement of Data Complexity for Classification Problems with Unbalanced Data [J].
Anwar, Nafees ;
Jones, Geoff ;
Ganesh, Siva .
STATISTICAL ANALYSIS AND DATA MINING, 2014, 7 (03) :194-211
[3]   Assessing the data complexity of imbalanced datasets [J].
Barella, Victor H. ;
Garcia, Luis P. F. ;
de Souto, Marcilio C. P. ;
Lorena, Ana C. ;
de Carvalho, Andre C. P. L. F. .
INFORMATION SCIENCES, 2021, 553 :83-109
[4]  
Barella VH, 2018, IEEE IJCNN
[5]   MWMOTE-Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning [J].
Barua, Sukarna ;
Islam, Md. Monirul ;
Yao, Xin ;
Murase, Kazuyuki .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (02) :405-425
[6]   An efficiency curve for evaluating imbalanced classifiers considering intrinsic data characteristics: Experimental analysis [J].
Chao, Xiangrui ;
Kou, Gang ;
Peng, Yi ;
Fernandez, Alberto .
INFORMATION SCIENCES, 2022, 608 :1131-1156
[7]   Handling data irregularities in classification: Foundations, trends, and future challenges [J].
Das, Swagatam ;
Datta, Shounak ;
Chaudhuri, Bidyut B. .
PATTERN RECOGNITION, 2018, 81 :674-693
[8]  
Demsar J, 2006, J MACH LEARN RES, V7, P1
[9]   Evolutionary inversion of class distribution in overlapping areas for multi-class imbalanced learning [J].
Fernandes, Everlandio R. Q. ;
de Carvalho, Andre C. P. L. F. .
INFORMATION SCIENCES, 2019, 494 :141-154
[10]  
Garcia LPF, 2018, INT C PATT RECOG, P874, DOI 10.1109/ICPR.2018.8545110