On the class overlap problem in imbalanced data classification

被引:161
作者
Vuttipittayamongkol, Pattaramon [1 ]
Elyan, Eyad [2 ]
Petrovski, Andrei [2 ]
机构
[1] Mae Fah Luang Univ, Sch Informat Technol, Chiang Rai, Thailand
[2] Robert Gordon Univ, Sch Comp, Aberdeen, Scotland
关键词
Imbalanced data; Class overlap; Classification; Evaluation metric; Benchmark; EXTREME LEARNING-MACHINE; SAMPLING METHOD; DATA-SETS; NEURAL-NETWORKS; SMOTE; PERFORMANCE; ENSEMBLE; BINARY; IDENTIFICATION; SENSITIVITY;
D O I
10.1016/j.knosys.2020.106631
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Class imbalance is an active research area in the machine learning community. However, existing and recent literature showed that class overlap had a higher negative impact on the performance of learning algorithms. This paper provides detailed critical discussion and objective evaluation of class overlap in the context of imbalanced data and its impact on classification accuracy. First, we present a thorough experimental comparison of class overlap and class imbalance. Unlike previous work, our experiment was carried out on the full scale of class overlap and an extreme range of class imbalance degrees. Second, we provide an in-depth critical technical review of existing approaches to handle imbalanced datasets. Existing solutions from selective literature are critically reviewed and categorised as class distribution-based and class overlap-based methods. Emerging techniques and the latest development in this area are also discussed in detail. Experimental results in this paper are consistent with existing literature and show clearly that the performance of the learning algorithm deteriorates across varying degrees of class overlap whereas class imbalance does not always have an effect. The review emphasises the need for further research towards handling class overlap in imbalanced datasets to effectively improve learning algorithms' performance. (C) 2020 Elsevier B.V. All rights reserved.
引用
收藏
页数:17
相关论文
共 144 条
[1]   Comparing classifiers when the misallocation costs are uncertain [J].
Adams, NM ;
Hand, DJ .
PATTERN RECOGNITION, 1999, 32 (07) :1139-1147
[2]  
Ali-Gombe A., 2019, IEEE IJCNN, P1
[3]   MFC-GAN: Class-imbalanced dataset classification using Multiple Fake Class Generative Adversarial Network [J].
Ali-Gombe, Adamu ;
Elyan, Eyad .
NEUROCOMPUTING, 2019, 361 :212-221
[4]  
Ali-Gombe Adamu., 2018, 2018 International Joint Conference on Neural Networks (IJCNN), P1, DOI DOI 10.1109/IJCNN.2018.8489387
[5]   Comparing Oversampling Techniques to Handle the Class Imbalance Problem: A Customer Churn Prediction Case Study [J].
Amin, Adnan ;
Anwar, Sajid ;
Adnan, Awais ;
Nawaz, Muhammad ;
Howard, Newton ;
Qadir, Junaid ;
Hawalah, Ahmad ;
Hussain, Amir .
IEEE ACCESS, 2016, 4 :7940-7957
[6]  
[Anonymous], 2017, ARXIV170907377
[7]   New applications of ensembles of classifiers [J].
Barandela, R ;
Sánchez, JS ;
Valdovinos, RM .
PATTERN ANALYSIS AND APPLICATIONS, 2003, 6 (03) :245-256
[8]   MWMOTE-Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning [J].
Barua, Sukarna ;
Islam, Md. Monirul ;
Yao, Xin ;
Murase, Kazuyuki .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (02) :405-425
[9]  
Batista GEAPA, 2005, LECT NOTES COMPUT SC, V3646, P24
[10]  
Bekkar M., 2013, J Inf Eng Appl, V3, P27, DOI DOI 10.5121/IJDKP.2013.3402