Optimal Entropy Genetic Fuzzy-C-Means SMOTE (OEGFCM-SMOTE)

被引:27
作者
El Moutaouakil, Karim [1 ]
Roudani, Mouhamed [1 ]
El Ouissari, Abdellatif [1 ]
机构
[1] Univ Sidi Mohamed Ben Abdellah, Multidisciplinary Fac Taza, Dept Math, Engn Sci Lab ESL, Fes, Morocco
关键词
Classification; Clustering; Entropy; Oversampling; GA; GMM; SMOTE; Unbalanced data; Big data; OVER-SAMPLING TECHNIQUE; CLASSIFICATION; CLASSIFIERS; PREDICTION; FRAMEWORK; SPARK;
D O I
10.1016/j.knosys.2022.110235
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Classification problems of unbalanced data sets are commonplace in industrial production and medical research fields. Different approaches have been proposed to handle these problems by generating synthetic samples, but most of them implement hyperparameters and tend to generate noise, because they neglect the entropy of the initial data. Recently, oversampling methods based on clustering have been proposed to overcome this problem. Unfortunately, they inherit the sensitivity of hard clustering methods. Moreover, the hyperparameters are manually selected. This paper introduces Optimal Entropy Genetic Fuzzy-C-Means SMOTE (OEGFCM-SMOTE) that balances data with minimum noise based on the original mathematical model, soft clustering, and evolutionary optimization. First, to handle the Kmeans sensitivity, OEGFCM-SMOTE uses a SMOTE to generate samples in safe regions based on Fuzzy-C-Means, known to be consistent with the boundary problem. Fuzzy-C-Means SMOTE processes in three steps (grouping, filtering, and interpolation) and implements 4 parameters, namely the number of clusters, the number of neighboring points of the minority data, the threshold of the unbalanced ratio and the exponent of the distribution of the minority data in the promising clusters. Second, the optimal choice of these parameters is based on a mixed-variable optimization model which minimizes the amount of noise measured by the entropy; the feasible domain is estimated by considering the density of the data set and by studying the boundary cases. Finally, this model is solved using the genetic algorithm by adopting genetic operators with appropriate rates. OEGFCM-SMOTE is evaluated using 5 classifiers, 21 unbalanced datasets (15 ordinary size and 6 Big data), and it is compared to 14 oversampling methods using three performance measures. To overcome the problem of multiple comparisons, considering different data sets, Holm's test is used. OEGFCM-SMOTE consistently outperforms other popular oversampling methods.(c) 2022 Elsevier B.V. All rights reserved.
引用
收藏
页数:27
相关论文
共 87 条
[61]   ENTROPY AND DATA-COMPRESSION SCHEMES [J].
ORNSTEIN, DS ;
WEISS, B .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1993, 39 (01) :78-83
[62]  
Phua C., 2004, SIGKDD Explorations, V6, P50
[63]   A synthetic informative minority over-sampling (SIMO) algorithm leveraging support vector machine to enhance learning from imbalanced datasets [J].
Piri, Saeed ;
Delen, Dursun ;
Liu, Tieming .
DECISION SUPPORT SYSTEMS, 2018, 106 :15-29
[64]  
Puntumapon K, 2012, PRUNING BASED APPROA
[65]   SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory [J].
Ramentol, Enislay ;
Caballero, Yaile ;
Bello, Rafael ;
Herrera, Francisco .
KNOWLEDGE AND INFORMATION SYSTEMS, 2012, 33 (02) :245-265
[66]   Noise Reduction A Priori Synthetic Over-Sampling for class imbalanced data sets [J].
Rivera, William A. .
INFORMATION SCIENCES, 2017, 408 :146-161
[67]   SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering [J].
Saez, Jose A. ;
Luengo, Julian ;
Stefanowski, Jerzy ;
Herrera, Francisco .
INFORMATION SCIENCES, 2015, 291 :184-203
[68]   A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients [J].
Santos, Miriam Seoane ;
Abreu, Pedro Henriques ;
Garcia-Laencina, Pedro J. ;
Simao, Adelia ;
Carvalho, Armando .
JOURNAL OF BIOMEDICAL INFORMATICS, 2015, 58 :49-59
[69]   Multi-class imbalanced big data classification on Spark [J].
Sleeman, William C. ;
Krawczyk, Bartosz .
KNOWLEDGE-BASED SYSTEMS, 2021, 212
[70]   RCSMOTE: Range-Controlled synthetic minority over-sampling technique for handling the class imbalance problem [J].
Soltanzadeh, Paria ;
Hashemzadeh, Mahdi .
INFORMATION SCIENCES, 2021, 542 :92-111