HCAB-SMOTE: A Hybrid Clustered Affinitive Borderline SMOTE Approach for Imbalanced Data Binary Classification

被引:34
作者
Al Majzoub, Hisham [1 ]
Elgedawy, Islam [2 ]
Akaydin, Oyku [3 ]
Ulukok, Mehtap Kose [4 ]
机构
[1] Cyprus Int Univ, Sch Appl Sci, Management Informat Syst Dept, Via Mersin 10, Nicosia, Turkey
[2] Middle East Tech Univ, Dept Comp Engn, Northern Cyprus Campus,Mersin 10, TR-99738 Kalkanli, Guzelyurt, Turkey
[3] Cyprus Int Univ, Dept Comp Engn, Via Mersin 10, Nicosia, Turkey
[4] Univ City Isl, Dept Software Engn, Via Mersin 10, Famagusta, Turkey
关键词
Imbalanced data; Borderline SMOTE; Oversampling; SMOTE; AB-SMOTE; k-means clustering;
D O I
10.1007/s13369-019-04336-1
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Binary datasets are considered imbalanced when one of their two classes has less than 40% of the total number of the data instances (i.e., minority class). Existing classification algorithms are biased when applied on imbalanced binary datasets, as they misclassify instances of minority class. Many techniques are proposed to minimize the bias and to increase the classification accuracy. Synthetic Minority Oversampling Technique (SMOTE) is a well-known approach proposed to address this problem. It generates new synthetic data instances to balance the dataset. Unfortunately, it generates these instances randomly, leading to the generation of useless new instances, which is time and memory consuming. Different SMOTE derivatives were proposed to overcome this problem (such as Borderline SMOTE), yet the number of generated instances slightly changed. To overcome such problem, this paper proposes a novel approach for generating synthesized data instances known as Hybrid Clustered Affinitive Borderline SMOTE (HCAB-SMOTE). It managed to minimize the number of generated instances while increasing the classification accuracy. It combines undersampling for removing majority noise instances and oversampling approaches to enhance the density of the borderline. It uses k-means clustering on the borderline area and identify which clusters to oversample to achieve better results. Experimental results show that HCAB-SMOTE outperformed SMOTE, Borderline SMOTE, AB-SMOTE and CAB-SMOTE approaches which were developed before reaching HCAB-SMOTE, as it provided the highest classification accuracy with the least number of generated instances.
引用
收藏
页码:3205 / 3222
页数:18
相关论文
共 23 条
[1]  
[Anonymous], WIN LOSS
[2]   The study of under- and over-sampling methods' utility in analysis of highly imbalanced data on osteoporosis [J].
Bach, M. ;
Werner, A. ;
Zywiec, J. ;
Pluskiewicz, W. .
INFORMATION SCIENCES, 2017, 384 :174-190
[3]  
Bekkar M., 2013, International Journal of Data Mining and Knowledge Management Process, V3, P15, DOI [10.5121/ijdkp.2013.3402, DOI 10.5121/IJDKP.2013.3402]
[4]  
Bunkhumpornpat C, 2009, LECT NOTES ARTIF INT, V5476, P475, DOI 10.1007/978-3-642-01307-2_43
[5]  
Cao SZ, 2019, INT WIREL COMMUN, P787, DOI 10.1109/IWCMC.2019.8766619
[6]   PSO-based method for SVM classification on skewed data sets [J].
Cervantes, Jair ;
Garcia-Lamont, Farid ;
Rodriguez-Mazahua, Lisbeth ;
Lopez, Asdrubal ;
Ruiz-Castilla, Jose ;
Trueba, Adrian .
NEUROCOMPUTING, 2017, 228 :187-197
[7]  
Chawla N. V., 2003, 7 EUR C PRINC PRACT, P107, DOI DOI 10.1007/978-3-540-39804-2_12
[8]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[9]   Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE [J].
Douzas, Georgios ;
Bacao, Fernando ;
Last, Felix .
INFORMATION SCIENCES, 2018, 465 :1-20
[10]  
Dua C, 2017, DHEERU GRAFF UCI MAC