An empirical study on the joint impact of feature selection and data resampling on imbalance classification

被引:28
|
作者
Zhang, Chongsheng [1 ]
Soda, Paolo [2 ,3 ]
Bi, Jingjun [1 ]
Fan, Gaojuan [1 ]
Almpanidis, George [1 ]
Garcia, Salvador [4 ]
Ding, Weiping [5 ]
机构
[1] Henan Univ, Henan Key Lab Big Data Anal & Proc, Kaifeng, Henan, Peoples R China
[2] Univ Campus Biomed Rome, Dept Engn, Rome, Italy
[3] Umea Univ, Dept Radiat Sci, Biomed Engn, Radiat Phys, Umea, Sweden
[4] Univ Granada, DaSCI Andalusian Res Inst, Granada, Spain
[5] Nantong Univ, Sch Informat Sci & Technol, Nantong, Peoples R China
关键词
Imbalanced classification; Feature selection; Data selection; Resampling; SMOTE;
D O I
10.1007/s10489-022-03772-1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many real-world datasets exhibit imbalanced distributions, in which the majority classes have sufficient samples, whereas the minority classes often have a very small number of samples. Data resampling has proven to be effective in alleviating such imbalanced settings, while feature selection is a commonly used technique for improving classification performance. However, the joint impact of feature selection and data resampling on two-class imbalance classification has rarely been addressed before. This work investigates the performance of two opposite imbalanced classification frameworks in which feature selection is applied before or after data resampling. We conduct a large-scale empirical study with a total of 9225 experiments on 52 publicly available datasets. The results show that both frameworks should be considered for finding the best performing imbalanced classification model. We also study the impact of classifiers, the ratio between the number of majority and minority samples (IR), and the ratio between the number of samples and features (SFR) on the performance of imbalance classification. Overall, this work provides a new reference value for researchers and practitioners in imbalance learning.
引用
收藏
页码:5449 / 5461
页数:13
相关论文
共 50 条
  • [31] Resampling approach for imbalanced data classification based on class instance density per feature value intervals
    Wang, Fei
    Zheng, Ming
    Ma, Kai
    Hu, Xiaowen
    INFORMATION SCIENCES, 2025, 692
  • [32] A Projected Feature Selection Algorithm for Data Classification
    Yin, Zhiwu
    Huang, Shangteng
    2007 INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, NETWORKING AND MOBILE COMPUTING, VOLS 1-15, 2007, : 3665 - 3668
  • [33] Feature Selection for Classification of Hyperspectral Data by SVM
    Pal, Mahesh
    Foody, Giles M.
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2010, 48 (05): : 2297 - 2307
  • [34] Feature Selection in Clinical Data Processing For Classification
    Seethal, C. R.
    Panicker, Janu R.
    Vasudevan, Veena
    PROCEEDINGS OF 2016 INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE (ICIS), 2016, : 172 - 175
  • [35] Bagging and Feature Selection for Classification with Incomplete Data
    Cao Truong Tran
    Zhang, Mengjie
    Andreae, Peter
    Xue, Bing
    APPLICATIONS OF EVOLUTIONARY COMPUTATION, EVOAPPLICATIONS 2017, PT I, 2017, 10199 : 471 - 486
  • [36] A joint multiobjective optimization of feature selection and classifier design for high-dimensional data classification
    Bai, Lixia
    Li, Hong
    Gao, Weifeng
    Xie, Jin
    Wang, Houqiang
    INFORMATION SCIENCES, 2023, 626 : 457 - 473
  • [37] A Study of Feature Selection Approaches for Classification
    Banu, A. K. Shafreen
    Ganesh, S. Hari
    2015 INTERNATIONAL CONFERENCE ON INNOVATIONS IN INFORMATION, EMBEDDED AND COMMUNICATION SYSTEMS (ICIIECS), 2015,
  • [38] Feature Selection for EEG Data Classification with Weka
    Murtazina, Marina
    Avdeenko, Tatiana
    ADVANCES IN SWARM INTELLIGENCE, ICSI 2022, PT II, 2022, : 279 - 288
  • [39] Sample imbalance disease classification model based on association rule feature selection
    Huang, Chenxi
    Huang, Xin
    Fang, Yu
    Xu, Jianfeng
    Qu, Yi
    Zhai, Pengjun
    Fan, Lin
    Yin, Hua
    Xu, Yilu
    Li, Jiahang
    PATTERN RECOGNITION LETTERS, 2020, 133 (133) : 280 - 286
  • [40] Selection of Augmented Data for Overcoming the Imbalance Problem in Facies Classification
    Kim, Dowan
    Byun, Joongmoo
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2022, 19