Resampling approach for imbalanced data classification based on class instance density per feature value intervals

被引:0
|
作者
Wang, Fei [1 ]
Zheng, Ming [1 ,2 ]
Ma, Kai [1 ]
Hu, Xiaowen [1 ]
机构
[1] Anhui Normal Univ, Sch Comp & Informat, Wuhu 241002, Peoples R China
[2] Anhui Normal Univ, Anhui Prov Key Lab Ind Intelligence Data Secur, Wuhu 241002, Anhui, Peoples R China
基金
中国国家自然科学基金;
关键词
Imbalanced datasets; Resampling; Classification; Class instance density; SMOTE;
D O I
10.1016/j.ins.2024.121570
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In practical applications, imbalanced datasets significantly degrade the classification performance of machine learning models. However, most conventional resampling approaches fall short in adequately addressing the varying contributions of individual features to the classification model. In response to this defect, this study introduces three novel resampling approaches. The first approach, Oversampling based on class instance density per feature value intervals (OCF), focuses on augmenting the dataset. The second approach, Undersampling based on class instance density per feature value intervals (UCF), seeks to reduce dataset size. The third approach, Hybrid sampling based on class instance density per feature value intervals (HSCF), which can perform oversampling and undersampling simultaneously. These approaches categorize feature value into different intervals based on their varying information content, calculate class instance densities within these intervals, and generate feature values in intervals with high discriminative information. Subsequently, these generated features are combined to synthesize minority class data, effectively achieving oversampling. Additionally, the study combines class instance density and feature importance to identify majority class data at the classification boundary with minimal contribution and subsequently executes undersampling. The flexibility to adjust sampling ratios and the integration of OCF and UCF enable the implementation of hybrid sampling. Finally, experiments on the benchmark dataset demonstrate the superiority and effectiveness of the proposed method. Furthermore, it is observed that the method proposed in this study enhances the feature dividing capability of decision tree classifiers. Hence, the best results are achieved when working in synergy with decision tree classifiers, leading to the most significant improvements in classification performance. All codes have been published at https://github.com/ Wangfeiopen/HSCF.
引用
收藏
页数:44
相关论文
共 50 条
  • [31] Oversampling the minority class in a multi-linear feature space for imbalanced data classification
    Liang, Peifeng
    Li, Weite
    Hu, Jinglu
    IEEJ TRANSACTIONS ON ELECTRICAL AND ELECTRONIC ENGINEERING, 2018, 13 (10) : 1483 - 1491
  • [32] A Classification Algorithm Based on Ensemble Feature Selections for Imbalanced-Class Dataset
    Yin, Hua
    Gai, Keke
    Wang, Zhijian
    2016 IEEE 2ND INTERNATIONAL CONFERENCE ON BIG DATA SECURITY ON CLOUD (BIGDATASECURITY), IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE AND SMART COMPUTING (HPSC), AND IEEE INTERNATIONAL CONFERENCE ON INTELLIGENT DATA AND SECURITY (IDS), 2016, : 245 - 249
  • [33] Deep Learning with MCA-based Instance Selection and Bootstrapping for Imbalanced Data Classification
    Guan, Sheng
    Chen, Min
    Ha, Hsin-Yu
    Chen, Shu-Ching
    Shyu, Mei-Ling
    Zhang, Chengde
    2015 IEEE CONFERENCE ON COLLABORATION AND INTERNET COMPUTING (CIC), 2015, : 288 - 295
  • [34] dSubSign: Classification of Instance-Feature Data Using Discriminative Subgraphs as Class Signatures
    Paranjape, Parnika N.
    Dhabu, Meera M.
    Deshpande, Parag S.
    INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 2021, 31 (07) : 917 - 947
  • [35] A Gaussian mixture model based combined resampling algorithm for classification of imbalanced credit data sets
    Xu Han
    Runbang Cui
    Yanfei Lan
    Yanzhe Kang
    Jiang Deng
    Ning Jia
    International Journal of Machine Learning and Cybernetics, 2019, 10 : 3687 - 3699
  • [36] An effective distance based feature selection approach for imbalanced data
    Shaukat Ali Shahee
    Usha Ananthakumar
    Applied Intelligence, 2020, 50 : 717 - 745
  • [37] An effective distance based feature selection approach for imbalanced data
    Shahee, Shaukat Ali
    Ananthakumar, Usha
    APPLIED INTELLIGENCE, 2020, 50 (03) : 717 - 745
  • [38] A membership-based resampling and cleaning algorithm for multi-class imbalanced overlapping data
    Ma, Tingting
    Lu, Shuxia
    Jiang, Chen
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 240
  • [39] Classification model for imbalanced traffic data based on secondary feature extraction
    Shen, Jian
    Xia, Jingbo
    Shan, Yong
    Wei, Zekun
    IET COMMUNICATIONS, 2017, 11 (11) : 1725 - 1731
  • [40] Highly imbalanced fault classification of wind turbines using data resampling and hybrid ensemble method approach
    Chatterjee, Subhajit
    Byun, Yung-Cheol
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 126