Grouped SMOTE With Noise Filtering Mechanism for Classifying Imbalanced Data

被引:52
作者
Cheng, Ke [1 ]
Zhang, Chen [1 ]
Yu, Hualong [1 ,2 ]
Yang, Xibei [1 ]
Zou, Haitao [1 ,2 ]
Gao, Shang [1 ,2 ]
机构
[1] Jiangsu Univ Sci & Technol, Sch Comp, Zhenjiang 212003, Jiangsu, Peoples R China
[2] Sichuan Univ Sci & Engn, Artificial Intelligence Key Lab Sichuan Prov, Yibin 644000, Peoples R China
基金
中国国家自然科学基金; 中国博士后科学基金;
关键词
Noise measurement; Classification algorithms; Filtering algorithms; Safety; Filtering; Estimation; Data models; Sampling; class imbalance learning; SMOTE; Gaussian-Mixture model; probability density; FRAUD DETECTION; SAMPLING METHOD; CLASSIFICATION; PREDICTION;
D O I
10.1109/ACCESS.2019.2955086
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
SMOTE (Synthetic Minority Oversampling TEchnique) is one of the most popular and well-known sampling algorithms for addressing class imbalance learning problem. The merits of SMOTE reflect at that in comparison with the random oversampling technique, it can alleviate the problem of overfitting to a large extent. However, two drawbacks of SMOTE have also been observed as follows, 1) it tends to propagate the noisy information in the procedure of oversampling; 2) it always assigns a global neighborhood parameter $K$ but neglects the local distribution characteristics. To synchronously deal with these two problems, a grouped SMOTE algorithm with noise filtering mechanism (GSMOTE-NFM) is presented in this article. The algorithm firstly adopts Gaussian-Mixture Model (GMM) to explore the real distributions of the majority and minority classes, respectively. Then, most noisy instances can be removed by comparing the probability densities of the same instance in two different classes. Next, two new GMMs are constructed on the rest majority and minority class instances, respectively. Furthermore, all minority class instances can be divided into three different groups: safety, boundary and outlier, based on the corresponding probability density information. Finally, we assign an individual parameter $K$ to the instances belonging to each specific group to generate new instances. We tested GSMOTE-NFM algorithm on 24 benchmark binary-class data sets with three popular classification models, and compared it with several state-of-the-art oversampling algorithms. The results indicate that our algorithm is significantly superior than the original SMOTE algorithm and several SMOTE-based modified methods.
引用
收藏
页码:170668 / 170681
页数:14
相关论文
共 50 条
[1]   Deep and Machine Learning Approaches for Anomaly-Based Intrusion Detection of Imbalanced Network Traffic [J].
Abdulhammed, Razan ;
Faezipour, Miad ;
Abuzneid, Abdelshakour ;
AbuMallouh, Arafat .
IEEE SENSORS LETTERS, 2019, 3 (01)
[2]   Electrocardiogram Classification Using Reservoir Computing With Logistic Regression [J].
Angel Escalona-Moran, Miguel ;
Soriano, Miguel C. ;
Fischer, Ingo ;
Mirasso, Claudio R. .
IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2015, 19 (03) :892-898
[3]  
[Anonymous], 2007, ICML
[4]   Behavioral Analysis of Insider Threat: A Survey and Bootstrapped Prediction in Imbalanced Data [J].
Azaria, Amos ;
Richardson, Ariella ;
Kraus, Sarit ;
Subrahmanian, V. S. .
IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2014, 1 (02) :135-155
[5]  
Batuwita R., 2010, The 2010 International Joint Conference on Neural Networks (IJCNN), P1, DOI DOI 10.1109/IJCNN.2010.5596787
[6]  
Bunkhumpornpat C, 2009, LECT NOTES ARTIF INT, V5476, P475, DOI 10.1007/978-3-642-01307-2_43
[7]   Novel Cost-Sensitive Approach to Improve the Multilayer Perceptron Performance on Imbalanced Data [J].
Castro, Cristiano L. ;
Braga, Antonio P. .
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2013, 24 (06) :888-899
[8]  
Chawla N. V., 2003, P INT C MACH LEARN, P1
[9]  
Chawla N. V., 2009, P 13 PAC AS KNOWL DI, P1
[10]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)