Feature selection and its combination with data over-sampling for multi-class imbalanced datasets

被引:22
作者
Tsai, Chih-Fong [1 ]
Chen, Kuan-Chen [1 ]
Lin, Wei -Chao [1 ,2 ,3 ,4 ]
机构
[1] Natl Cent Univ, Dept Informat Management, Taoyuan, Taiwan
[2] Chang Gung Univ, Dept Digital Financial Technol, Taoyuan, Taiwan
[3] Chang Gung Univ, Dept Informat Management, Taoyuan, Taiwan
[4] Chang Gung Mem Hosp Linkou, Dept Thorac Surg, Taoyuan, Taiwan
关键词
Feature selection; Ensemble feature selection; Machine learning; Class imbalance learning; Over-sampling; CLASSIFICATION; TRENDS; SMOTE;
D O I
10.1016/j.asoc.2024.111267
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Feature selection aims at filtering out some unrepresentative features from a given dataset in order to construct more effective learning models. Furthermore, ensemble feature selection by combining multiple feature selection methods has shown its outperformance over single feature selection. However, the performances of different (ensemble) feature selection methods have not been fully examined over multi -class imbalanced datasets. On the other hand, for class imbalanced datasets, one widely considered solution is to re -balance the datasets by data over -sampling, which generates some synthetic examples for the minority classes. However, the effect of performing (ensemble) feature selection on over -sampling multi -class imbalanced datasets has not been investigated. Therefore, the first research objective is to examine the performances of single and ensemble feature selection methods by fifteen well-known filter, wrapper, and embedded algorithms in terms of classification accuracy. For the second research objective, two orders of combining the feature selection and over -sampling steps are compared in order to find out the best combination procedure as well as the best combined algorithms. The experimental results based on ten different domain datasets containing low to very high feature dimensions show that ensemble feature selection methods slightly perform better than single ones. However, their performance differences are not big. To combine with the Synthetic Minority Oversampling Technique (SMOTE) over -sampling algorithm, performing feature selection first and over -sampling second outperforms the other procedure. Although the best combined algorithms are based on ensemble feature selection, eXtreme Gradient Boosting (XGBoost), as the single best feature selection algorithm, combined with SMOTE provides very similar classification performance to the best combined algorithms. To consider the issues of classification performance and compactional cost, the optimal solution is based on the combined XGBoost and SMOTE.
引用
收藏
页数:16
相关论文
共 38 条
[1]   To Combat Multi-Class Imbalanced Problems by Means of Over-Sampling Techniques [J].
Abdi, Lida ;
Hashemi, Sattar .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2016, 28 (01) :238-251
[2]  
Al Khaldy M., 2018, Int. Robot. Autom. J, V4, P37
[3]   LoRAS: an oversampling approach for imbalanced datasets [J].
Bej, Saptarshi ;
Davtyan, Narek ;
Wolfien, Markus ;
Nassar, Mariam ;
Wolkenhauer, Olaf .
MACHINE LEARNING, 2021, 110 (02) :279-301
[4]   Ensembles for feature selection: A review and future trends [J].
Bolon-Canedo, Veronica ;
Alonso-Betanzos, Amparo .
INFORMATION FUSION, 2019, 52 :1-12
[5]   A Survey of Predictive Modeling on Im balanced Domains [J].
Branco, Paula ;
Torgo, Luis ;
Ribeiro, Rita P. .
ACM COMPUTING SURVEYS, 2016, 49 (02)
[6]   Study of Multi-Class Classification Algorithms' Performance on Highly Imbalanced Network Intrusion Datasets [J].
Bulavas, Viktoras ;
Marcinkevicius, Virginijus ;
Ruminski, Jacek .
INFORMATICA, 2021, 32 (03) :441-475
[7]   A survey on feature selection methods [J].
Chandrashekar, Girish ;
Sahin, Ferat .
COMPUTERS & ELECTRICAL ENGINEERING, 2014, 40 (01) :16-28
[8]   Handling data irregularities in classification: Foundations, trends, and future challenges [J].
Das, Swagatam ;
Datta, Shounak ;
Chaudhuri, Bidyut B. .
PATTERN RECOGNITION, 2018, 81 :674-693
[9]  
Dash M., 1997, Intelligent Data Analysis, V1
[10]  
Demsar J, 2006, J MACH LEARN RES, V7, P1