Optimal Entropy Genetic Fuzzy-C-Means SMOTE (OEGFCM-SMOTE)

被引:27
作者
El Moutaouakil, Karim [1 ]
Roudani, Mouhamed [1 ]
El Ouissari, Abdellatif [1 ]
机构
[1] Univ Sidi Mohamed Ben Abdellah, Multidisciplinary Fac Taza, Dept Math, Engn Sci Lab ESL, Fes, Morocco
关键词
Classification; Clustering; Entropy; Oversampling; GA; GMM; SMOTE; Unbalanced data; Big data; OVER-SAMPLING TECHNIQUE; CLASSIFICATION; CLASSIFIERS; PREDICTION; FRAMEWORK; SPARK;
D O I
10.1016/j.knosys.2022.110235
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Classification problems of unbalanced data sets are commonplace in industrial production and medical research fields. Different approaches have been proposed to handle these problems by generating synthetic samples, but most of them implement hyperparameters and tend to generate noise, because they neglect the entropy of the initial data. Recently, oversampling methods based on clustering have been proposed to overcome this problem. Unfortunately, they inherit the sensitivity of hard clustering methods. Moreover, the hyperparameters are manually selected. This paper introduces Optimal Entropy Genetic Fuzzy-C-Means SMOTE (OEGFCM-SMOTE) that balances data with minimum noise based on the original mathematical model, soft clustering, and evolutionary optimization. First, to handle the Kmeans sensitivity, OEGFCM-SMOTE uses a SMOTE to generate samples in safe regions based on Fuzzy-C-Means, known to be consistent with the boundary problem. Fuzzy-C-Means SMOTE processes in three steps (grouping, filtering, and interpolation) and implements 4 parameters, namely the number of clusters, the number of neighboring points of the minority data, the threshold of the unbalanced ratio and the exponent of the distribution of the minority data in the promising clusters. Second, the optimal choice of these parameters is based on a mixed-variable optimization model which minimizes the amount of noise measured by the entropy; the feasible domain is estimated by considering the density of the data set and by studying the boundary cases. Finally, this model is solved using the genetic algorithm by adopting genetic operators with appropriate rates. OEGFCM-SMOTE is evaluated using 5 classifiers, 21 unbalanced datasets (15 ordinary size and 6 Big data), and it is compared to 14 oversampling methods using three performance measures. To overcome the problem of multiple comparisons, considering different data sets, Holm's test is used. OEGFCM-SMOTE consistently outperforms other popular oversampling methods.(c) 2022 Elsevier B.V. All rights reserved.
引用
收藏
页数:27
相关论文
共 87 条
[1]   A Dynamic Spark-based Classification Framework for Imbalanced Big Data [J].
Abdel-Hamid, Nahla B. ;
ElGhamrawy, Sally ;
El Desouky, Ali ;
Arafat, Hesham .
JOURNAL OF GRID COMPUTING, 2018, 16 (04) :607-626
[2]   Intelligent Local Search for an Optimal Control of Diabetic Population Dynamics [J].
Abdellatif E.O. ;
Karim E.M. ;
Hicham B. ;
Saliha C. .
Mathematical Models and Computer Simulations, 2022, 14 (6) :1051-1071
[3]   To Combat Multi-Class Imbalanced Problems by Means of Over-Sampling Techniques [J].
Abdi, Lida ;
Hashemi, Sattar .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2016, 28 (01) :238-251
[4]   An Efficient Over-sampling Approach Based on Mean Square Error Back-propagation for Dealing with the Multi-class Imbalance Problem [J].
Alejo, R. ;
Garcia, V. ;
Pacheco-Sanchez, J. H. .
NEURAL PROCESSING LETTERS, 2015, 42 (03) :603-617
[5]   Comparing Oversampling Techniques to Handle the Class Imbalance Problem: A Customer Churn Prediction Case Study [J].
Amin, Adnan ;
Anwar, Sajid ;
Adnan, Awais ;
Nawaz, Muhammad ;
Howard, Newton ;
Qadir, Junaid ;
Hawalah, Ahmad ;
Hussain, Amir .
IEEE ACCESS, 2016, 4 :7940-7957
[6]  
Arafa A, 2022, J KING SAUD UNIV-COM
[7]   MWMOTE-Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning [J].
Barua, Sukarna ;
Islam, Md. Monirul ;
Yao, Xin ;
Murase, Kazuyuki .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (02) :405-425
[8]  
Batista G.E.A.P.A., 2004, SIGKDD Explorations, V6, P20, DOI [10.1145/1007730.1007735, DOI 10.1145/1007730.1007735]
[9]   Rough Sets in Imbalanced Data Problem: Improving Re-sampling Process [J].
Borowska, Katarzyna ;
Stepaniuk, Jaroslaw .
COMPUTER INFORMATION SYSTEMS AND INDUSTRIAL MANAGEMENT (CISIM 2017), 2017, 10244 :459-469
[10]   A Survey of Predictive Modeling on Im balanced Domains [J].
Branco, Paula ;
Torgo, Luis ;
Ribeiro, Rita P. .
ACM COMPUTING SURVEYS, 2016, 49 (02)