Resampling approach for imbalanced data classification based on class instance density per feature value intervals

被引：0

作者：

Wang, Fei ^{[1
]}

Zheng, Ming ^{[1
,2
]}

Ma, Kai ^{[1
]}

Hu, Xiaowen ^{[1
]}

机构：

[1] Anhui Normal Univ, Sch Comp & Informat, Wuhu 241002, Peoples R China

[2] Anhui Normal Univ, Anhui Prov Key Lab Ind Intelligence Data Secur, Wuhu 241002, Anhui, Peoples R China

来源：

INFORMATION SCIENCES | 2025年 / 692卷

基金：

中国国家自然科学基金;

关键词：

Imbalanced datasets; Resampling; Classification; Class instance density; SMOTE;

D O I：

10.1016/j.ins.2024.121570

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In practical applications, imbalanced datasets significantly degrade the classification performance of machine learning models. However, most conventional resampling approaches fall short in adequately addressing the varying contributions of individual features to the classification model. In response to this defect, this study introduces three novel resampling approaches. The first approach, Oversampling based on class instance density per feature value intervals (OCF), focuses on augmenting the dataset. The second approach, Undersampling based on class instance density per feature value intervals (UCF), seeks to reduce dataset size. The third approach, Hybrid sampling based on class instance density per feature value intervals (HSCF), which can perform oversampling and undersampling simultaneously. These approaches categorize feature value into different intervals based on their varying information content, calculate class instance densities within these intervals, and generate feature values in intervals with high discriminative information. Subsequently, these generated features are combined to synthesize minority class data, effectively achieving oversampling. Additionally, the study combines class instance density and feature importance to identify majority class data at the classification boundary with minimal contribution and subsequently executes undersampling. The flexibility to adjust sampling ratios and the integration of OCF and UCF enable the implementation of hybrid sampling. Finally, experiments on the benchmark dataset demonstrate the superiority and effectiveness of the proposed method. Furthermore, it is observed that the method proposed in this study enhances the feature dividing capability of decision tree classifiers. Hence, the best results are achieved when working in synergy with decision tree classifiers, leading to the most significant improvements in classification performance. All codes have been published at https://github.com/ Wangfeiopen/HSCF.

引用

页数：44

共 50 条

[31] A membership-based resampling and cleaning algorithm for multi-class imbalanced overlapping data
Ma, Tingting
Lu, Shuxia
Jiang, Chen
EXPERT SYSTEMS WITH APPLICATIONS, 2024, 240
[32] dSubSign: Classification of Instance-Feature Data Using Discriminative Subgraphs as Class Signatures
Paranjape, Parnika N.
Dhabu, Meera M.
Deshpande, Parag S.
INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 2021, 31 (07) : 917 - 947
[33] An effective distance based feature selection approach for imbalanced data
Shaukat Ali Shahee
Usha Ananthakumar
Applied Intelligence, 2020, 50 : 717 - 745
[34] An effective distance based feature selection approach for imbalanced data
Shahee, Shaukat Ali
Ananthakumar, Usha
APPLIED INTELLIGENCE, 2020, 50 (03) : 717 - 745
[35] A Classification Algorithm Based on Ensemble Feature Selections for Imbalanced-Class Dataset
Yin, Hua
Gai, Keke
Wang, Zhijian
2016 IEEE 2ND INTERNATIONAL CONFERENCE ON BIG DATA SECURITY ON CLOUD (BIGDATASECURITY), IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE AND SMART COMPUTING (HPSC), AND IEEE INTERNATIONAL CONFERENCE ON INTELLIGENT DATA AND SECURITY (IDS), 2016, : 245 - 249
[36] Deep Learning with MCA-based Instance Selection and Bootstrapping for Imbalanced Data Classification
Guan, Sheng
Chen, Min
Ha, Hsin-Yu
Chen, Shu-Ching
Shyu, Mei-Ling
Zhang, Chengde
2015 IEEE CONFERENCE ON COLLABORATION AND INTERNET COMPUTING (CIC), 2015, : 288 - 295
[37] A weighted pattern matching approach for classification of imbalanced data with a fireworks-based algorithm for feature selection
Sreeja, N. K.
CONNECTION SCIENCE, 2019, 31 (02) : 143 - 168
[38] A multi-manifold learning based instance weighting and under-sampling for imbalanced data classification problems
Feizi, Tayyebe
Moattar, Mohammad Hossein
Tabatabaee, Hamid
JOURNAL OF BIG DATA, 2023, 10 (01)
[39] A Gaussian mixture model based combined resampling algorithm for classification of imbalanced credit data sets
Han, Xu
Cui, Runbang
Lan, Yanfei
Kang, Yanzhe
Deng, Jiang
Jia, Ning
INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2019, 10 (12) : 3687 - 3699
[40] A DBN-based resampling SVM ensemble learning paradigm for credit classification with imbalanced data
Yu, Lean
Zhou, Rongtian
Tang, Ling
Chen, Rongda
APPLIED SOFT COMPUTING, 2018, 69 : 192 - 202

← 1 2 3 4 5 →