Resampling approach for imbalanced data classification based on class instance density per feature value intervals

被引:0
|
作者
Wang, Fei [1 ]
Zheng, Ming [1 ,2 ]
Ma, Kai [1 ]
Hu, Xiaowen [1 ]
机构
[1] Anhui Normal Univ, Sch Comp & Informat, Wuhu 241002, Peoples R China
[2] Anhui Normal Univ, Anhui Prov Key Lab Ind Intelligence Data Secur, Wuhu 241002, Anhui, Peoples R China
基金
中国国家自然科学基金;
关键词
Imbalanced datasets; Resampling; Classification; Class instance density; SMOTE;
D O I
10.1016/j.ins.2024.121570
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In practical applications, imbalanced datasets significantly degrade the classification performance of machine learning models. However, most conventional resampling approaches fall short in adequately addressing the varying contributions of individual features to the classification model. In response to this defect, this study introduces three novel resampling approaches. The first approach, Oversampling based on class instance density per feature value intervals (OCF), focuses on augmenting the dataset. The second approach, Undersampling based on class instance density per feature value intervals (UCF), seeks to reduce dataset size. The third approach, Hybrid sampling based on class instance density per feature value intervals (HSCF), which can perform oversampling and undersampling simultaneously. These approaches categorize feature value into different intervals based on their varying information content, calculate class instance densities within these intervals, and generate feature values in intervals with high discriminative information. Subsequently, these generated features are combined to synthesize minority class data, effectively achieving oversampling. Additionally, the study combines class instance density and feature importance to identify majority class data at the classification boundary with minimal contribution and subsequently executes undersampling. The flexibility to adjust sampling ratios and the integration of OCF and UCF enable the implementation of hybrid sampling. Finally, experiments on the benchmark dataset demonstrate the superiority and effectiveness of the proposed method. Furthermore, it is observed that the method proposed in this study enhances the feature dividing capability of decision tree classifiers. Hence, the best results are achieved when working in synergy with decision tree classifiers, leading to the most significant improvements in classification performance. All codes have been published at https://github.com/ Wangfeiopen/HSCF.
引用
收藏
页数:44
相关论文
共 50 条
  • [41] Improved multi-class classification approach for imbalanced big data on spark
    Singh, Tinku
    Khanna, Riya
    Satakshi
    Kumar, Manish
    JOURNAL OF SUPERCOMPUTING, 2023, 79 (06) : 6583 - 6611
  • [42] Weighted Ensemble with one-class Classification and Over-sampling and Instance selection (WECOI): An approach for learning from imbalanced data streams
    Czarnowski, Ireneusz
    JOURNAL OF COMPUTATIONAL SCIENCE, 2022, 61
  • [43] The quest for the optimal class distribution: An approach for enhancing the effectiveness of learning via resampling methods for imbalanced data sets
    Albisua I.
    Arbelaitz O.
    Gurrutxaga I.
    Lasarguren A.
    Muguerza J.
    Pérez J.M.
    Pérez, J. M. (txus.perez@ehu.es), 1600, Springer Verlag (02): : 45 - 63
  • [44] Emphasizing feature inter-class separability for improving highly imbalanced overlapped data classification
    Yan, Huiran
    Cui, Zenghao
    Luo, Xinyi
    Wang, Rui
    Yao, Yuan
    KNOWLEDGE-BASED SYSTEMS, 2023, 276
  • [45] A Pareto-based Ensemble with Feature and Instance Selection for Learning from Multi-Class Imbalanced Datasets
    Fernandez, Alberto
    Jose Carmona, Cristobal
    Jose del Jesus, Maria
    Herrera, Francisco
    INTERNATIONAL JOURNAL OF NEURAL SYSTEMS, 2017, 27 (06)
  • [46] An Imbalanced Data Classification Method Based on Hybrid Resampling and Fine Cost Sensitive Support Vector Machine
    Zhu, Bo
    Jing, Xiaona
    Qiu, Lan
    Li, Runbo
    CMC-COMPUTERS MATERIALS & CONTINUA, 2024, 79 (03): : 3977 - 3999
  • [47] RB-CCR: Radial-Based Combined Cleaning and Resampling algorithm for imbalanced data classification
    Michał Koziarski
    Colin Bellinger
    Michał Woźniak
    Machine Learning, 2021, 110 : 3059 - 3093
  • [48] Clustering-based Binary-class Classification for Imbalanced Data Sets
    Chen, Chao
    Shyu, Mei-Ling
    2011 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IRI), 2011, : 384 - 389
  • [49] A cluster-based hybrid sampling approach for imbalanced data classification
    Feng, Shou
    Zhao, Chunhui
    Fu, Ping
    REVIEW OF SCIENTIFIC INSTRUMENTS, 2020, 91 (05)
  • [50] GMMSampling: a new model-based, data difficulty-driven resampling method for multi-class imbalanced data
    Naglik, Iwo
    Lango, Mateusz
    MACHINE LEARNING, 2024, 113 (08) : 5183 - 5202