A Novel Borderline Over-Sampling Method Based on KNN and Deep Gaussian Mixture Model for Imbalanced Data

被引:0
作者
Zhang H. [1 ,2 ]
Xiao H. [1 ,3 ]
Yi C. [1 ,3 ]
Yuan R. [1 ,3 ]
机构
[1] Key Laboratory of Metallurgical Equipment and Control Technology, Ministry of Education, Wuhan University of Science and Technology, Wuhan
[2] Hubei Key Laboratory of Mechanical Transmission and Manufacturing Engineering, Wuhan University of Science and Technology, Wuhan
[3] Precision Manufacturing Institute, Wuhan University of Science and Technology, Wuhan
基金
中国博士后科学基金; 中国国家自然科学基金;
关键词
Deep Gaussian Mixture Model; Imbalanced Data; Over-Sampling;
D O I
10.11925/infotech.2096-3467.2022.0609
中图分类号
学科分类号
摘要
[Objective] This paper proposes a borderline oversampling method based on the k-nearest neighbor algorithm (KNN) and Deep Gaussian Mixture Model (DGMM) to address the classifier bias due to data imbalance. [Methods] Firstly, we used the KNN algorithm to obtain the borderline minority samples in the training set. Secondly, we constructed a DGMM for the minority samples. Next, we applied the DGMM in reverse to generate the oversampling samples that conform to the distribution characteristics of the borderline minority samples. Finally, we used the three sigma guidelines to remove noise samples. We repeated the process until no outlier samples were generated. [Results] The proposed method improved the AUC and G-mean up to 8.62% and 12.99%, respectively. The corresponding average increased by 3.51% and 4.93%. [Limitations] The parameter optimization method for DGMM needs further improvement. [Conclusions] The proposed method can better address the problem of imbalanced data. © 2023 Data Analysis and Knowledge Discovery. All rights reserved.
引用
收藏
页码:116 / 122
页数:6
相关论文
共 19 条
[1]  
Zhao C S, Xin Y, Li X F, Et al., A Heterogeneous Ensemble Learning Framework for Spam Detection in Social Networks with Imbalanced Data, Applied Sciences, 10, 3, (2020)
[2]  
Ghorbani M, Kazi A, Baghshah M S, Et al., RA-GCN: Graph Convolutional Network for Disease Prediction Problems with Imbalanced Data, Medical Image Analysis, 75, (2022)
[3]  
Xiao Lianjie, Gao Mengrui, Su Xinning, An Undersampling Ensemble Classification Algorithm Based on Fuzzy C-Means Clustering for Imbalanced Data, Data Analysis and Knowledge Discovery, 3, 4, pp. 90-96, (2019)
[4]  
Chawla N V, Bowyer K W, Hall L O, Et al., SMOTE: Synthetic Minority Over-Sampling Technique, Journal of Artificial Intelligence Research, 16, pp. 321-357, (2002)
[5]  
Nekooeimehr I, Lai-Yuen S K., Adaptive Semi-unsupervised Weighted Oversampling (A-SUWO) for Imbalanced Datasets, Expert Systems with Applications, 46, pp. 405-416, (2016)
[6]  
Han H, Wang W Y, Mao B H., Borderline-SMOTE: A New OverSampling Method in Imbalanced Data Sets Learning[C], Advances in Intelligent Computing, (2005)
[7]  
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C., Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling Technique for Handling the Class Imbalanced Problem, Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 475-482, (2009)
[8]  
Pradipta G A, Wardoyo R, Musdholifah A, Et al., Radius-SMOTE: A New Oversampling Technique of Minority Samples Based on Radius Distance for Learning from Imbalanced Data, IEEE Access, 9, pp. 74763-74777, (2021)
[9]  
Douzas G, Bacao F., Geometric SMOTE a Geometrically Enhanced Drop-in Replacement for SMOTE, Information Sciences, 501, C, pp. 118-135, (2019)
[10]  
Yang S J, Cha K J., GMOTE: Gaussian Based Minority Oversampling Technique for Imbalanced Classification Adapting Tail Probability of Outliers