Comprehensive empirical investigation for prioritizing the pipeline of using feature selection and data resampling techniques

被引:0
|
作者
Tyagi, Pooja [1 ]
Singh, Jaspreeti [1 ]
Gosain, Anjana [1 ]
机构
[1] Guru Gobind Singh Indraprastha Univ, Univ Sch Informat Commun & Technol, New Delhi, India
关键词
Imbalanced data; feature selection; machine learning; oversampling; undersampling; CLASS-IMBALANCED DATASETS; CLASSIFICATION METHOD; PREDICTION; SMOTE; CLASSIFIERS; TESTS;
D O I
10.3233/JIFS-233511
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The contemporary real-world datasets often suffer from the problem of class imbalance as well as high dimensionality. For combating class imbalance, data resampling is a commonly used approach whereas for tackling high dimensionality feature selection is used. The aforesaid problems have been studied extensively as independent problems in the literature but the possible synergy between them is still not clear. This paper studies the effects of addressing both the issues in conjunction by using a combination of resampling and feature selection techniques on binary-class imbalance classification. In particular, the primary goal of this study is to prioritize the sequence or pipeline of using these techniques and to analyze the performance of the two opposite pipelines that apply feature selection before or after resampling techniques i.e., F + S or S + F. For this, a comprehensive empirical study is carried out by conducting a total of 34,560 tests on 30 publicly available datasets using a combination of 12 resampling techniques for class imbalance and 12 feature selection methods, evaluating the performance on 4 different classifiers. Through the experiments we conclude that there is no specific pipeline that proves better than the other and both the pipelines should be considered for obtaining the best classification results on high dimensional imbalanced data. Additionally, while using Decision Tree (DT) or Random Forest (RF) as base learner the predominance of S + F over F + S is observed whereas in case of Support Vector Machine (SVM) and Logistic Regression (LR), F + S outperforms S + F in most cases. According to the mean ranking obtained from Friedman test the best combination of resampling and feature selection techniques for DT, SVM, LR and RF are SMOTE + RFE (Synthetic Minority Oversampling Technique and Recursive Feature Elimination), Least Absolute Shrinkage and Selection Operator (LASSO) + SMOTE, SMOTE + Embedded feature selection using RF and SMOTE + RFE respectively.
引用
收藏
页码:6019 / 6040
页数:22
相关论文
共 50 条
  • [41] A comprehensive survey of feature selection techniques based on whale optimization algorithm
    Mohammad Amiriebrahimabadi
    Najme Mansouri
    Multimedia Tools and Applications, 2024, 83 : 47775 - 47846
  • [42] A comprehensive survey of feature selection techniques based on whale optimization algorithm
    Amiriebrahimabadi, Mohammad
    Mansouri, Najme
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (16) : 47775 - 47846
  • [43] Fault diagnosis on material handling system using feature selection and data mining techniques
    Demetgul, M.
    Yildiz, K.
    Taskin, S.
    Tansel, I. N.
    Yazicioglu, O.
    MEASUREMENT, 2014, 55 : 15 - 24
  • [44] Review on intrusion detection using feature selection with machine learning techniques
    Kalimuthan, C.
    Renjit, J. Arokia
    MATERIALS TODAY-PROCEEDINGS, 2020, 33 : 3794 - 3802
  • [45] Similarity of feature selection methods: An empirical study across data intensive classification tasks
    Dessi, Nicoletta
    Pes, Barbara
    EXPERT SYSTEMS WITH APPLICATIONS, 2015, 42 (10) : 4632 - 4642
  • [46] Detecting Parametric Dependencies for Performance Models Using Feature Selection Techniques
    Grohmann, Johannes
    Eismann, Simon
    Elflein, Sven
    von Kistowski, Joakim
    Kounev, Samuel
    Mazkatli, Manar
    2019 IEEE 27TH INTERNATIONAL SYMPOSIUM ON MODELING, ANALYSIS, AND SIMULATION OF COMPUTER AND TELECOMMUNICATION SYSTEMS (MASCOTS 2019), 2019, : 309 - 322
  • [47] From Baseline to Best Practice: An Advanced Feature Selection, Feature Resampling and Grid Search Techniques to Improve Injury Severity Prediction
    EL Ferouali, Soukaina
    Abou Elassad, Zouhair Elamrani
    Qassimi, Sara
    Abdali, Abdelmounaim
    APPLIED ARTIFICIAL INTELLIGENCE, 2025, 39 (01)
  • [48] An Empirical Study of Filter-based Feature Selection Algorithms Using Noisy Training Data
    Yuan, Weiwei
    Guan, Donghai
    Shen, Linshan
    Pan, Haiwei
    2014 4TH IEEE INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND TECHNOLOGY (ICIST), 2014, : 209 - 212
  • [49] Optimizing Neural Networks for Academic Performance Classification Using Feature Selection and Resampling Approach
    Supriyadi D.
    Purwanto P.
    Warsito B.
    Mendel, 2023, 29 (02) : 261 - 272
  • [50] On the Stability of Feature Selection Methods in Software Quality Prediction: An Empirical Investigation
    Wang, Huanjing
    Khoshgoftaar, Taghi M.
    Seliya, Naeem
    INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 2015, 25 (9-10) : 1467 - 1490