Comprehensive empirical investigation for prioritizing the pipeline of using feature selection and data resampling techniques

被引:0
|
作者
Tyagi, Pooja [1 ]
Singh, Jaspreeti [1 ]
Gosain, Anjana [1 ]
机构
[1] Guru Gobind Singh Indraprastha Univ, Univ Sch Informat Commun & Technol, New Delhi, India
关键词
Imbalanced data; feature selection; machine learning; oversampling; undersampling; CLASS-IMBALANCED DATASETS; CLASSIFICATION METHOD; PREDICTION; SMOTE; CLASSIFIERS; TESTS;
D O I
10.3233/JIFS-233511
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The contemporary real-world datasets often suffer from the problem of class imbalance as well as high dimensionality. For combating class imbalance, data resampling is a commonly used approach whereas for tackling high dimensionality feature selection is used. The aforesaid problems have been studied extensively as independent problems in the literature but the possible synergy between them is still not clear. This paper studies the effects of addressing both the issues in conjunction by using a combination of resampling and feature selection techniques on binary-class imbalance classification. In particular, the primary goal of this study is to prioritize the sequence or pipeline of using these techniques and to analyze the performance of the two opposite pipelines that apply feature selection before or after resampling techniques i.e., F + S or S + F. For this, a comprehensive empirical study is carried out by conducting a total of 34,560 tests on 30 publicly available datasets using a combination of 12 resampling techniques for class imbalance and 12 feature selection methods, evaluating the performance on 4 different classifiers. Through the experiments we conclude that there is no specific pipeline that proves better than the other and both the pipelines should be considered for obtaining the best classification results on high dimensional imbalanced data. Additionally, while using Decision Tree (DT) or Random Forest (RF) as base learner the predominance of S + F over F + S is observed whereas in case of Support Vector Machine (SVM) and Logistic Regression (LR), F + S outperforms S + F in most cases. According to the mean ranking obtained from Friedman test the best combination of resampling and feature selection techniques for DT, SVM, LR and RF are SMOTE + RFE (Synthetic Minority Oversampling Technique and Recursive Feature Elimination), Least Absolute Shrinkage and Selection Operator (LASSO) + SMOTE, SMOTE + Embedded feature selection using RF and SMOTE + RFE respectively.
引用
收藏
页码:6019 / 6040
页数:22
相关论文
共 50 条
  • [31] Impact of Feature Selection Techniques on the Performance of Machine Learning Models for Depression Detection Using EEG Data
    Hassan, Marwa
    Kaabouch, Naima
    APPLIED SCIENCES-BASEL, 2024, 14 (22):
  • [32] A Literature Review of Feature Selection Techniques and Applications Review of feature selection in data mining
    Visalakshi, S.
    Radha, V.
    2014 IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMPUTING RESEARCH (IEEE ICCIC), 2014, : 966 - 971
  • [33] Detection of financial statement fraud and feature selection using data mining techniques
    Ravisankar, P.
    Ravi, V.
    Rao, G. Raghava
    Bose, I.
    DECISION SUPPORT SYSTEMS, 2011, 50 (02) : 491 - 500
  • [34] Combination of Feature Selection and Resampling Methods to Predict Preterm Birth Based on Electrohysterographic Signals from Imbalance Data
    Nieto-del-Amor, Felix
    Prats-Boluda, Gema
    Garcia-Casado, Javier
    Diaz-Martinez, Alba
    Jose Diago-Almela, Vicente
    Monfort-Ortiz, Rogelio
    Hao, Dongmei
    Ye-Lin, Yiyao
    SENSORS, 2022, 22 (14)
  • [35] Feature selection using social network techniques
    Azadifar, Saeid
    Monadjemi, Seyed Amirhasan
    2015 7TH CONFERENCE ON INFORMATION AND KNOWLEDGE TECHNOLOGY (IKT), 2015,
  • [36] Improve Abstract Data with Feature Selection for Classification Techniques
    Nuipian, Vatinee
    Meesad, Phayung
    Boonrawd, Pudsadee
    FUTURE INFORMATION TECHNOLOGY, 2011, 13 : 213 - 217
  • [37] Improve Abstract Data with Feature Selection for Classification Techniques
    Nuipian, Vatinee
    Meesad, Phayung
    Boonrawd, Pudsadee
    MEMS, NANO AND SMART SYSTEMS, PTS 1-6, 2012, 403-408 : 3699 - +
  • [38] Feature selection using multimodal optimization techniques
    Kamyab, Shima
    Eftekhari, Mahdi
    NEUROCOMPUTING, 2016, 171 : 586 - 597
  • [39] A Review of the Stability of Feature Selection Techniques for Bioinformatics Data
    Awada, Wael
    Khoshgoftaar, Taghi M.
    Dittman, David
    Wald, Randall
    Napolitano, Amri
    2012 IEEE 13TH INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IRI), 2012, : 356 - 363
  • [40] Golden eagle based improved Att-BiLSTM model for big data classification with hybrid feature extraction and feature selection techniques
    Kotikam, Gnanendra
    Selvaraj, Lokesh
    NETWORK-COMPUTATION IN NEURAL SYSTEMS, 2024, 35 (02) : 154 - 189