An oversampling method for imbalanced dataset based on sparsity and boundary degree

被引:0
|
作者
Zhen Xue [1 ]
Yan Gao [2 ]
Liangliang Zhang [1 ]
Xu Yang [1 ]
Jianzhen Wu [1 ]
机构
[1] North University of China,School of Mathematics
[2] Beijing University of Technology,Science Department
关键词
Machine learning; Imbalanced dataset; Oversampling; HRSB-SMOTE; Sparsity; Boundary degree;
D O I
10.1007/s11042-024-19767-8
中图分类号
学科分类号
摘要
In order to improve the classification accuracy of minority samples in imbalanced dataset, we proposed a novel oversampling method, HRSB-SMOTE (HDBSCAN-Ratio-Sparsity-Boundary-SMOTE), which is based on HDBSCAN clustering and SMOTE, and combined with cluster ratio, sparsity and boundary degree of minority samples. Firstly, we implemented HDBSCAN clustering on the minority samples and removed the noisy samples by grade of membership. Then, according to the cluster ratio and sparsity of the minority samples in each cluster, we ascertained the number of needed synthetic samples for such cluster. Afterwards, we determined the number of needed synthetic samples for each minority sample based on the boundary degree of it in each cluster. Finally, we synthesized minority sample by using SMOTE according to the aforementioned number. Our proposed method not only can preserve the distribution characteristics of original data, but also can reinforce the decision boundary and avoid the generation of noisy samples. The experimental results on 13 real-world datasets from UCI Repository show that the performance of HRSB-SMOTE method is superior to other six popular oversampling methods (such as SMOTE, Borderline-SMOTE, ANASYN, etc.) in terms of F-measure, G-mean, and Acc on most datasets. Compared with SMOTE, Borderline-SMOTE, and k-means-SMOTE, the F-measure value of HRSB-SMOTE method on winequality-red-8vs6 dataset with higher IR (Imbalance Ratio) and with KNN classifier is improved by 10.79%, 1.16%, and 7.89%, respectively. HRSB-SMOTE method effectively handles the imbalanced problem between and within classes.
引用
收藏
页码:17361 / 17387
页数:26
相关论文
共 50 条
  • [1] A new boundary-degree-based oversampling method for imbalanced data
    Chen, Yueqi
    Pedrycz, Witold
    Yang, Jie
    APPLIED INTELLIGENCE, 2023, 53 (22) : 26518 - 26541
  • [2] A new boundary-degree-based oversampling method for imbalanced data
    Yueqi Chen
    Witold Pedrycz
    Jie Yang
    Applied Intelligence, 2023, 53 : 26518 - 26541
  • [3] An Improved MAHAKIL Oversampling Method for Imbalanced Dataset Classification
    Zhang, Yong
    Zuo, Tingting
    Fang, Lichao
    Li, Jun
    Xing, Zongyi
    IEEE ACCESS, 2021, 9 : 16030 - 16040
  • [4] Fault Prediction for Network Equipment based on Oversampling Method in Imbalanced Dataset
    Yin, Faming
    Du, Qingbo
    Chen, Mingzi
    Bao, Qiuxia
    Gao, Yun
    2019 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS - TAIWAN (ICCE-TW), 2019,
  • [5] Explainability of SMOTE Based Oversampling for Imbalanced Dataset Problems
    Patil, Aum
    Framewala, Aman
    Kazi, Faruk
    2020 3RD INTERNATIONAL CONFERENCE ON INFORMATION AND COMPUTER TECHNOLOGIES (ICICT 2020), 2020, : 41 - 45
  • [6] Imbalanced Learning with Oversampling based on Classification Contribution Degree
    Jiang, Zhenhao
    Yang, Jie
    Liu, Yan
    ADVANCED THEORY AND SIMULATIONS, 2021, 4 (05)
  • [7] OVERSAMPLING METHOD FOR IMBALANCED CLASSIFICATION
    Zheng, Zhuoyuan
    Cai, Yunpeng
    Li, Ye
    COMPUTING AND INFORMATICS, 2015, 34 (05) : 1017 - 1037
  • [8] Improving Imbalanced Dataset Classification Using Oversampling and Gradient Boosting
    Cahyana, Nurheri
    Khomsah, Siti
    Aribowo, Agus Sasmito
    2019 5TH INTERNATIONAL CONFERENCE ON SCIENCE ININFORMATION TECHNOLOGY (ICSITECH): EMBRACING INDUSTRY 4.0 - TOWARDS INNOVATION IN CYBER PHYSICAL SYSTEM, 2019, : 217 - 222
  • [9] SVDD-based weighted oversampling technique for imbalanced and overlapped dataset learning
    Tao, Xinmin
    Zheng, Yujia
    Chen, Wei
    Zhang, Xiaohan
    Qi, Lin
    Fan, Zhiting
    Huang, Shan
    INFORMATION SCIENCES, 2022, 588 : 13 - 51
  • [10] Integrating oversampling and ensemble-based machine learning techniques for an imbalanced dataset in dyslexia screening tests
    Kaisar, Shahriar
    Chowdhury, Abdullahi
    ICT EXPRESS, 2022, 8 (04): : 563 - 568