Experimental Comparison of Classification Methods under Class Imbalance

被引:2
作者
Chen, Hui [1 ]
Ji, Mengru [2 ]
机构
[1] Beijing Foreign Studies Univ, Beijing, Peoples R China
[2] Univ Gottingen, Gottingen, Germany
关键词
Imbalance Classification; Resampling; Cost-Sensitive Learning; Distance Metric Learning; Ensemble Learning; Performance Evaluation; ENSEMBLE METHODS; PERFORMANCE; ALGORITHMS; SMOTE;
D O I
10.4108/eai.11-6-2021.170234
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The class imbalance problem is prevalent in many domains including medical, natural language processing, image recognition, economic and geographic areas etc. We perform a systematic experimental comparison of different imbalance classification algorithms - ranging from sampling, distance metric learning, cost -sensitive learning to ensemble learning approaches - on several datasets from UCI, KEEL and OpenML. The algorithms included DDAE, MWMOTE, SMOTE, RUSBoost, AdaBoost, cost-sensitive decision tree (csDCT), self-paced Ensemble Classifier, MetaCost, CAdaMEC and Iterative Metric Learning (IML). As the substantial bias potentially caused by imbalance classification can be harmful for underrepresented classes which are of critical social and economic values and even lives, the main objective of our study is thus to understand the impact of imbalance ratio and the size of the utilized datasets on the performance of the above-mentioned algorithms. Our experiments show that 1) Sampling methods perform the worst and cannot be used directly for imbalanced classification, since they lack of consideration of neighborhoods based on distance. However, some classifiers can be improved after the balance of class distribution. 2) Cost-sensitive learning models should be utilized when the dataset is less imbalanced, because it is difficult to set an appropriate cost matrix for a specific dataset, which can cause performance fluctuations. 3) IML consistently shows good performance (in terms of F1 and AUCPRC), is resilient to different imbalance ratios but sensitive to the data distribution of the dataset. 4) Ensemble learning techniques generally perform better over other approaches due to their combined intelligence of multiple basic classifiers. 5) In terms of system performance, self-paced Ensemble Classifier performs fairly well with regards to learning time, while IML and DDAE yield the longest learning time; AdaBoost and self-paced Ensemble Classifier are two algorithms require lowest memory usage. We also provide our empirical recommendation for algorithm selection under different requirements and usage scenarios based on our analysis.
引用
收藏
页数:20
相关论文
共 95 条
[11]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[12]   SMOTEBoost: Improving prediction of the minority class in boosting [J].
Chawla, NV ;
Lazarevic, A ;
Hall, LO ;
Bowyer, KW .
KNOWLEDGE DISCOVERY IN DATABASES: PKDD 2003, PROCEEDINGS, 2003, 2838 :107-119
[13]  
Chawla NV., 2004, ACM SIGKDD Explorations Newsletter, V6, P1, DOI DOI 10.1145/1007730.1007733
[14]   SUPPORT-VECTOR NETWORKS [J].
CORTES, C ;
VAPNIK, V .
MACHINE LEARNING, 1995, 20 (03) :273-297
[15]   NEAREST NEIGHBOR PATTERN CLASSIFICATION [J].
COVER, TM ;
HART, PE .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1967, 13 (01) :21-+
[16]  
Cup K., 2004, Protein homology dataset
[17]   COMPOSITE CLASSIFIER SYSTEM-DESIGN - CONCEPTS AND METHODOLOGY [J].
DASARATHY, BV ;
SHEELA, BV .
PROCEEDINGS OF THE IEEE, 1979, 67 (05) :708-713
[18]   Machine-learning methods for identifying social media-based requests for urgent help during hurricanes [J].
Devaraj, Ashwin ;
Murthy, Dhiraj ;
Dontula, Aman .
INTERNATIONAL JOURNAL OF DISASTER RISK REDUCTION, 2020, 51
[19]  
Domingos P., 1999, P ACM SIGKDD INT C K, DOI DOI 10.1145/312129.312220
[20]   Neural attention with character embeddings for hay fever detection from twitter [J].
Du, Jiahua ;
Michalska, Sandra ;
Subramani, Sudha ;
Wang, Hua ;
Zhang, Yanchun .
HEALTH INFORMATION SCIENCE AND SYSTEMS, 2019, 7 (01)