Resampling approach for imbalanced data classification based on class instance density per feature value intervals

被引：0

作者：

Wang, Fei ^{[1
]}

Zheng, Ming ^{[1
,2
]}

Ma, Kai ^{[1
]}

Hu, Xiaowen ^{[1
]}

机构：

[1] Anhui Normal Univ, Sch Comp & Informat, Wuhu 241002, Peoples R China

[2] Anhui Normal Univ, Anhui Prov Key Lab Ind Intelligence Data Secur, Wuhu 241002, Anhui, Peoples R China

来源：

INFORMATION SCIENCES | 2025年 / 692卷

基金：

中国国家自然科学基金;

关键词：

Imbalanced datasets; Resampling; Classification; Class instance density; SMOTE;

D O I：

10.1016/j.ins.2024.121570

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In practical applications, imbalanced datasets significantly degrade the classification performance of machine learning models. However, most conventional resampling approaches fall short in adequately addressing the varying contributions of individual features to the classification model. In response to this defect, this study introduces three novel resampling approaches. The first approach, Oversampling based on class instance density per feature value intervals (OCF), focuses on augmenting the dataset. The second approach, Undersampling based on class instance density per feature value intervals (UCF), seeks to reduce dataset size. The third approach, Hybrid sampling based on class instance density per feature value intervals (HSCF), which can perform oversampling and undersampling simultaneously. These approaches categorize feature value into different intervals based on their varying information content, calculate class instance densities within these intervals, and generate feature values in intervals with high discriminative information. Subsequently, these generated features are combined to synthesize minority class data, effectively achieving oversampling. Additionally, the study combines class instance density and feature importance to identify majority class data at the classification boundary with minimal contribution and subsequently executes undersampling. The flexibility to adjust sampling ratios and the integration of OCF and UCF enable the implementation of hybrid sampling. Finally, experiments on the benchmark dataset demonstrate the superiority and effectiveness of the proposed method. Furthermore, it is observed that the method proposed in this study enhances the feature dividing capability of decision tree classifiers. Hence, the best results are achieved when working in synergy with decision tree classifiers, leading to the most significant improvements in classification performance. All codes have been published at https://github.com/ Wangfeiopen/HSCF.

引用

页数：44

共 50 条

[41] Improved multi-class classification approach for imbalanced big data on spark
Singh, Tinku
Khanna, Riya
Satakshi
Kumar, Manish
JOURNAL OF SUPERCOMPUTING, 2023, 79 (06) : 6583 - 6611
[42] Weighted Ensemble with one-class Classification and Over-sampling and Instance selection (WECOI): An approach for learning from imbalanced data streams
Czarnowski, Ireneusz
JOURNAL OF COMPUTATIONAL SCIENCE, 2022, 61
[43] The quest for the optimal class distribution: An approach for enhancing the effectiveness of learning via resampling methods for imbalanced data sets
Albisua I.
Arbelaitz O.
Gurrutxaga I.
Lasarguren A.
Muguerza J.
Pérez J.M.
Pérez, J. M. (txus.perez@ehu.es), 1600, Springer Verlag (02): : 45 - 63
[44] Emphasizing feature inter-class separability for improving highly imbalanced overlapped data classification
Yan, Huiran
Cui, Zenghao
Luo, Xinyi
Wang, Rui
Yao, Yuan
KNOWLEDGE-BASED SYSTEMS, 2023, 276
[45] A Pareto-based Ensemble with Feature and Instance Selection for Learning from Multi-Class Imbalanced Datasets
Fernandez, Alberto
Jose Carmona, Cristobal
Jose del Jesus, Maria
Herrera, Francisco
INTERNATIONAL JOURNAL OF NEURAL SYSTEMS, 2017, 27 (06)
[46] An Imbalanced Data Classification Method Based on Hybrid Resampling and Fine Cost Sensitive Support Vector Machine
Zhu, Bo
Jing, Xiaona
Qiu, Lan
Li, Runbo
CMC-COMPUTERS MATERIALS & CONTINUA, 2024, 79 (03): : 3977 - 3999
[47] RB-CCR: Radial-Based Combined Cleaning and Resampling algorithm for imbalanced data classification
Michał Koziarski
Colin Bellinger
Michał Woźniak
Machine Learning, 2021, 110 : 3059 - 3093
[48] Clustering-based Binary-class Classification for Imbalanced Data Sets
Chen, Chao
Shyu, Mei-Ling
2011 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IRI), 2011, : 384 - 389
[49] A cluster-based hybrid sampling approach for imbalanced data classification
Feng, Shou
Zhao, Chunhui
Fu, Ping
REVIEW OF SCIENTIFIC INSTRUMENTS, 2020, 91 (05)
[50] GMMSampling: a new model-based, data difficulty-driven resampling method for multi-class imbalanced data
Naglik, Iwo
Lango, Mateusz
MACHINE LEARNING, 2024, 113 (08) : 5183 - 5202

← 1 2 3 4 5 →