Resampling approach for imbalanced data classification based on class instance density per feature value intervals

被引:0
|
作者
Wang, Fei [1 ]
Zheng, Ming [1 ,2 ]
Ma, Kai [1 ]
Hu, Xiaowen [1 ]
机构
[1] Anhui Normal Univ, Sch Comp & Informat, Wuhu 241002, Peoples R China
[2] Anhui Normal Univ, Anhui Prov Key Lab Ind Intelligence Data Secur, Wuhu 241002, Anhui, Peoples R China
基金
中国国家自然科学基金;
关键词
Imbalanced datasets; Resampling; Classification; Class instance density; SMOTE;
D O I
10.1016/j.ins.2024.121570
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In practical applications, imbalanced datasets significantly degrade the classification performance of machine learning models. However, most conventional resampling approaches fall short in adequately addressing the varying contributions of individual features to the classification model. In response to this defect, this study introduces three novel resampling approaches. The first approach, Oversampling based on class instance density per feature value intervals (OCF), focuses on augmenting the dataset. The second approach, Undersampling based on class instance density per feature value intervals (UCF), seeks to reduce dataset size. The third approach, Hybrid sampling based on class instance density per feature value intervals (HSCF), which can perform oversampling and undersampling simultaneously. These approaches categorize feature value into different intervals based on their varying information content, calculate class instance densities within these intervals, and generate feature values in intervals with high discriminative information. Subsequently, these generated features are combined to synthesize minority class data, effectively achieving oversampling. Additionally, the study combines class instance density and feature importance to identify majority class data at the classification boundary with minimal contribution and subsequently executes undersampling. The flexibility to adjust sampling ratios and the integration of OCF and UCF enable the implementation of hybrid sampling. Finally, experiments on the benchmark dataset demonstrate the superiority and effectiveness of the proposed method. Furthermore, it is observed that the method proposed in this study enhances the feature dividing capability of decision tree classifiers. Hence, the best results are achieved when working in synergy with decision tree classifiers, leading to the most significant improvements in classification performance. All codes have been published at https://github.com/ Wangfeiopen/HSCF.
引用
收藏
页数:44
相关论文
共 50 条
  • [41] A new instance density-based synthetic minority oversampling method for imbalanced classification problems
    Ma, Chung-Kang
    Park, You-Jin
    ENGINEERING OPTIMIZATION, 2022, 54 (10) : 1743 - 1757
  • [42] A DBN-based resampling SVM ensemble learning paradigm for credit classification with imbalanced data
    Yu, Lean
    Zhou, Rongtian
    Tang, Ling
    Chen, Rongda
    APPLIED SOFT COMPUTING, 2018, 69 : 192 - 202
  • [43] A Gaussian mixture model based combined resampling algorithm for classification of imbalanced credit data sets
    Han, Xu
    Cui, Runbang
    Lan, Yanfei
    Kang, Yanzhe
    Deng, Jiang
    Jia, Ning
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2019, 10 (12) : 3687 - 3699
  • [44] A weighted pattern matching approach for classification of imbalanced data with a fireworks-based algorithm for feature selection
    Sreeja, N. K.
    CONNECTION SCIENCE, 2019, 31 (02) : 143 - 168
  • [45] Improved multi-class classification approach for imbalanced big data on spark
    Singh, Tinku
    Khanna, Riya
    Satakshi
    Kumar, Manish
    JOURNAL OF SUPERCOMPUTING, 2023, 79 (06): : 6583 - 6611
  • [46] Improved multi-class classification approach for imbalanced big data on spark
    Tinku Singh
    Riya Khanna
    Manish Satakshi
    The Journal of Supercomputing, 2023, 79 : 6583 - 6611
  • [47] Imbalanced ELM Based on Normal Density Estimation for Binary-Class Classification
    He, Yulin
    Ashfaq, Rana Aamir Raza
    Huang, Joshua Zhexue
    Wang, Xizhao
    TRENDS AND APPLICATIONS IN KNOWLEDGE DISCOVERY AND DATA MINING (PAKDD 2016), 2016, 9794 : 48 - 60
  • [48] Bagging Using Instance-Level Difficulty for Multi-Class Imbalanced Big Data Classification on Spark
    Sleeman, William C.
    Krawczyk, Bartosz
    2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2019, : 2484 - 2493
  • [49] Weighted Ensemble with one-class Classification and Over-sampling and Instance selection (WECOI): An approach for learning from imbalanced data streams
    Czarnowski, Ireneusz
    JOURNAL OF COMPUTATIONAL SCIENCE, 2022, 61