A unifying view of class overlap and imbalance: Key concepts, multi-view panorama, and open avenues for research

被引:47
作者
Santos, Miriam Seoane [1 ]
Abreu, Pedro Henriques [1 ]
Japkowicz, Nathalie [2 ]
Fernandez, Alberto [3 ]
Santos, Joao [4 ,5 ]
机构
[1] Univ Coimbra, Dept Informat Engn, CISUC, P-3030290 Coimbra, Portugal
[2] Amer Univ, Dept Comp Sci, Washington, DC 20016 USA
[3] Univ Granada, Andalusian Res Inst Data Sci & Computat Intellige, Dept Comp Sci & Artificial Intelligence, DaSCI, Granada, Spain
[4] Univ Porto, Inst Ciencias Biomed Abel Salazar, Porto, Portugal
[5] IPO Porto Res Ctr CI IPOP, Porto, Portugal
关键词
Class imbalance; Imbalanced data; Class overlap; Data complexity; Data intrinsic characteristics; Complexity measures; FEATURE-SELECTION; DATA COMPLEXITY; COVID-19; CLASSIFICATION; SOFTWARE TOOL; ALGORITHMS; SMOTE; CLASSIFIERS; MACHINE; FUSION; KEEL;
D O I
10.1016/j.inffus.2022.08.017
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The combination of class imbalance and overlap is currently one of the most challenging issues in machine learning. While seminal work focused on establishing class overlap as a complicating factor for classification tasks in imbalanced domains, ongoing research mostly concerns the study of their synergy over real-word applications. However, given the lack of a well-formulated definition and measurement of class overlap in real-world domains, especially in the presence of class imbalance, the research community has not yet reached a consensus on the characterisation of both problems. This naturally complicates the evaluation of existing approaches to address these issues simultaneously and prevents future research from moving towards the devise of specialised solutions. In this work, we advocate for a unified view of the problem of class overlap in imbalanced domains. Acknowledging class overlap as the overarching problem - since it has proven to be more harmful for classification tasks than class imbalance - we start by discussing the key concepts associated to its definition, identification, and measurement in real-world domains, while advocating for a characterisation of the problem that attends to multiple sources of complexity. We then provide an overview of existing data complexity measures and establish the link to what specific types of class overlap problems these measures cover, proposing a novel taxonomy of class overlap complexity measures. Additionally, we characterise the relationship between measures, the insights they provide, and discuss to what extent they account for class imbalance. Finally, we systematise the current body of knowledge on the topic across several branches of Machine Learning (Data Analysis, Data Preprocessing, Algorithm Design, and Meta-learning), identifying existing limitations and discussing possible lines for future research.
引用
收藏
页码:228 / 253
页数:26
相关论文
共 145 条
[131]   Improved Overlap-based Undersampling for Imbalanced Dataset Classification with Application to Epilepsy and Parkinson's Disease [J].
Vuttipittayamongkol, Pattaramon ;
Elyan, Eyad .
INTERNATIONAL JOURNAL OF NEURAL SYSTEMS, 2020, 30 (08)
[132]   Neighbourhood-based undersampling approach for handling imbalanced and overlapped data [J].
Vuttipittayamongkol, Pattaramon ;
Elyan, Eyad .
INFORMATION SCIENCES, 2020, 509 :47-70
[133]   Overlap-Based Undersampling for Improving Imbalanced Data Classification [J].
Vuttipittayamongkol, Pattaramon ;
Elyan, Eyad ;
Petrovski, Andrei ;
Jayne, Chrisina .
INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2018, PT I, 2018, 11314 :689-697
[134]   COVID-19 classification by CCSHNet with deep fusion using transfer learning and discriminant correlation analysis [J].
Wang, Shui-Hua ;
Nayak, Deepak Ranjan ;
Guttery, David S. ;
Zhang, Xin ;
Zhang, Yu-Dong .
INFORMATION FUSION, 2021, 68 :131-148
[135]   Covid-19 classification by FGCNet with deep feature fusion from graph convolutional network and convolutional neural network [J].
Wang, Shui-Hua ;
Govindaraj, Vishnu Varthanan ;
Manuel Gorriz, Juan ;
Zhang, Xin ;
Zhang, Yu-Dong .
INFORMATION FUSION, 2021, 67 :208-229
[136]   NI-MWMOTE: An improving noise-immunity majority weighted minority oversampling technique for imbalanced classification problems [J].
Wei, Jianan ;
Huang, Haisong ;
Yao, Liguo ;
Hu, Yao ;
Fan, Qingsong ;
Huang, Dong .
EXPERT SYSTEMS WITH APPLICATIONS, 2020, 158
[137]   IA-SUWO: An Improving Adaptive semi-unsupervised weighted oversampling for imbalanced classification problems [J].
Wei Jianan ;
Huang Haisong ;
Yao Liguo ;
Hu Yao ;
Fan Qingsong ;
Huang Dong .
KNOWLEDGE-BASED SYSTEMS, 2020, 203
[138]   A data complexity analysis on imbalanced datasets and an alternative imbalance recovering strategy [J].
Weng, Cheng G. ;
Poon, Josiah .
2006 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, (WI 2006 MAIN CONFERENCE PROCEEDINGS), 2006, :270-+
[139]   Difficulty Factors and Preprocessing in Imbalanced Data Sets: An Experimental Study on Artificial Data [J].
Wojciechowski S. ;
Wilk S. .
1600, Walter de Gruyter GmbH (42) :149-176
[140]   Risk Prediction of Diabetes: Big data mining with fusion of multifarious physical examination indicators [J].
Yang, Hui ;
Luo, Yamei ;
Ren, Xiaolei ;
Wu, Ming ;
He, Xiaolin ;
Peng, Bowen ;
Deng, Kejun ;
Yan, Dan ;
Tang, Hua ;
Lin, Hao .
INFORMATION FUSION, 2021, 75 :140-149