Comprehensive empirical investigation for prioritizing the pipeline of using feature selection and data resampling techniques

被引:0
|
作者
Tyagi, Pooja [1 ]
Singh, Jaspreeti [1 ]
Gosain, Anjana [1 ]
机构
[1] Guru Gobind Singh Indraprastha Univ, Univ Sch Informat Commun & Technol, New Delhi, India
关键词
Imbalanced data; feature selection; machine learning; oversampling; undersampling; CLASS-IMBALANCED DATASETS; CLASSIFICATION METHOD; PREDICTION; SMOTE; CLASSIFIERS; TESTS;
D O I
10.3233/JIFS-233511
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The contemporary real-world datasets often suffer from the problem of class imbalance as well as high dimensionality. For combating class imbalance, data resampling is a commonly used approach whereas for tackling high dimensionality feature selection is used. The aforesaid problems have been studied extensively as independent problems in the literature but the possible synergy between them is still not clear. This paper studies the effects of addressing both the issues in conjunction by using a combination of resampling and feature selection techniques on binary-class imbalance classification. In particular, the primary goal of this study is to prioritize the sequence or pipeline of using these techniques and to analyze the performance of the two opposite pipelines that apply feature selection before or after resampling techniques i.e., F + S or S + F. For this, a comprehensive empirical study is carried out by conducting a total of 34,560 tests on 30 publicly available datasets using a combination of 12 resampling techniques for class imbalance and 12 feature selection methods, evaluating the performance on 4 different classifiers. Through the experiments we conclude that there is no specific pipeline that proves better than the other and both the pipelines should be considered for obtaining the best classification results on high dimensional imbalanced data. Additionally, while using Decision Tree (DT) or Random Forest (RF) as base learner the predominance of S + F over F + S is observed whereas in case of Support Vector Machine (SVM) and Logistic Regression (LR), F + S outperforms S + F in most cases. According to the mean ranking obtained from Friedman test the best combination of resampling and feature selection techniques for DT, SVM, LR and RF are SMOTE + RFE (Synthetic Minority Oversampling Technique and Recursive Feature Elimination), Least Absolute Shrinkage and Selection Operator (LASSO) + SMOTE, SMOTE + Embedded feature selection using RF and SMOTE + RFE respectively.
引用
收藏
页码:6019 / 6040
页数:22
相关论文
共 50 条
  • [21] A Comprehensive Survey on Metaheuristic Algorithm for Feature Selection Techniques
    Kumar, R. Arun
    Franklin, J. Vijay
    Koppula, Neeraja
    MATERIALS TODAY-PROCEEDINGS, 2022, 64 : 435 - 441
  • [22] An empirical study to investigate the impact of data resampling techniques on the performance of class maintainability prediction models
    Malhotra, Ruchika
    Lata, Kusum
    NEUROCOMPUTING, 2021, 459 : 432 - 453
  • [23] Investigation of Hybrid Feature Selection Techniques for Autism Classification using EEG Signals
    Thirumal, S.
    Thangakumar, J.
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (04) : 651 - 659
  • [24] Feature Selection of Microarray Data Using Simulated Kalman Filter with Mutation
    Zamri, Nurhawani Ahmad
    Aziz, Nor Azlina Ab
    Bhuvaneswari, Thangavel
    Aziz, Nor Hidayati Abdul
    Ghazali, Anith Khairunnisa
    PROCESSES, 2023, 11 (08)
  • [25] A Meta-Review of Feature Selection Techniques in the Context of Microarray Data
    Mungloo-Dilmohamud, Zahra
    Jaufeerally-Fakim, Yasmina
    Pena-Reyes, Carlos
    BIOINFORMATICS AND BIOMEDICAL ENGINEERING, IWBBIO 2017, PT I, 2017, 10208 : 33 - 49
  • [26] Software Development Effort Estimation Using Feature Selection Techniques
    Hosni, Mohamed
    Idri, Ali
    NEW TRENDS IN INTELLIGENT SOFTWARE METHODOLOGIES, TOOLS AND TECHNIQUES (SOMET_18), 2018, 303 : 439 - 452
  • [27] Efficient prediction of evaporation using ensemble feature selection techniques
    Sharma, Rakhee
    Singh, Archana
    Mittal, Mamta
    MAUSAM, 2023, 74 (04): : 951 - 962
  • [28] Attack classification using feature selection techniques: a comparative study
    Ankit Thakkar
    Ritika Lohiya
    Journal of Ambient Intelligence and Humanized Computing, 2021, 12 : 1249 - 1266
  • [29] Attack classification using feature selection techniques: a comparative study
    Thakkar, Ankit
    Lohiya, Ritika
    JOURNAL OF AMBIENT INTELLIGENCE AND HUMANIZED COMPUTING, 2021, 12 (01) : 1249 - 1266
  • [30] Effective Feature Selection Using Ensemble Techniques and Genetic Algorithm
    Ghorpade-Aher, Jayshree
    Sonkamble, Balwant
    PROCEEDINGS OF SIXTH INTERNATIONAL CONGRESS ON INFORMATION AND COMMUNICATION TECHNOLOGY (ICICT 2021), VOL 2, 2022, 236 : 367 - 375