A novel instance density-based hybrid resampling for imbalanced classification problems

被引:1
作者
Park, You-Jin [1 ]
Ma, Chung-Kang [1 ]
机构
[1] Department of Industrial Engineering and Management, National Taipei University of Technology, Taipei
关键词
Class imbalance; Classification; Hybrid resampling; Instance density; Machine learning;
D O I
10.1007/s00500-025-10499-x
中图分类号
学科分类号
摘要
The class imbalance problem is one of the challenging issues in various machine learning applications. This problem occurs when the number of instances of a class is much smaller (or larger) than those of the other classes. To handle the imbalanced classification problems, many useful approaches have been developed, for example, synthetic minority oversampling technique (SMOTE). However, the SMOTE is often sensitive to the predetermined k value, i.e., the number of nearest neighbors used to generate the synthetic instances. For example, if the k value is moderately large, some of the synthetic instances generated by the SMOTE would be located near a decision boundary or even within the majority class area and thus these can be treated as unnecessary noisy instances. Thus, in this study, we propose an efficient hybrid resampling method based on instance density called IDHR (Instance Density-based Hybrid Resampling) to improve the classification performance by generating instances that are closer to the minority class than the majority class while avoiding generation of noisy instances. For this, we first apply the instance density-based oversampling (IDO) technique to generate new synthetic instances. And then, we eliminate some of the synthetic instances that are close to the decision boundary and determine the number of the synthetic instances among the retained synthetic ones which can be eliminated based on maximum of the distances from all the synthetic instances to the minority class instances and minimum of the distances from all the synthetic instances to the majority class instances as well as classification performances. To demonstrate the effectiveness of the proposed resampling method, comprehensive experiments are conducted on sixteen imbalanced datasets with considering three classifiers, i.e., C4.5 decision tree algorithm, support vector machine (SVM), and multi-layer perceptron neural network (MLP-NN). Through the experimental analysis, it is shown that the proposed resampling method outperforms the traditional oversampling methods with respect to AUC and F-measure for most of the imbalanced datasets regardless of classifiers. © The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2025.
引用
收藏
页码:2031 / 2045
页数:14
相关论文
共 29 条
[1]  
Batista G.E.A.P.A., Prati R.C., Monard M.C., A study of the behavior of several methods for balancing machine learning training data, ACM SIGKDD Explor Newslett, 6, 1, pp. 20-29, (2004)
[2]  
Barua S., Islam M.M., Yao X., Murase K., MWMOTE—majority weighted minority oversampling technique for imbalanced data set learning, IEEE Trans Knowl Data Eng, 26, 2, pp. 405-425, (2014)
[3]  
Bunkhumpornpat C., Sinapiromsaran K., Lursinsap C., Safe-level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem, Pacific-Asia conference on knowledge discovery and data mining, pp. 475-482, (2009)
[4]  
Douzas G., Bacao F., Self-organizing map oversampling (SOMO) for imbalanced data set learning, Expert Syst Appl, 82, pp. 40-52, (2017)
[5]  
Douzas G., Rauch R., Bacao F., G-SOMO: An oversampling approach based on self-organized maps and geometric SMOTE, Expert Syst Appl, 183, (2021)
[6]  
Fan S.K.S., Tsai D.-M., He F., Huang J.-Y., Jen C.-H., Key parameter identification and defective wafer detection of semiconductor manufacturing processes using image processing techniques, IEEE Trans Semiconduct Manuf, 32, pp. 544-552, (2019)
[7]  
Fernandez A., Garcia S., Galar M., Prati R.C., Krawczyk B., Herrera F., Learning from imbalanced data sets, (2018)
[8]  
Fernandez A., Garcia S., Herrera F., Chawla N.V., SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary, J Artif Intell Res, 61, pp. 863-905, (2018)
[9]  
Guo J., Wu H., Chen X., Lin W., Adaptive SV-Borderline SMOTE-SVM algorithm for imbalanced data classification, Appl Soft Comput, 150, (2024)
[10]  
He H., Bai Y., Garcia E.A., Li S., ADASYN: adaptive synthetic sampling approach for imbalanced learning, IEEE international joint conference on neural networks (IJCNN 2008), pp. 1322-1328, (2008)