Analysis and comparison of feature selection methods towards performance and stability

被引:30
作者
Barbieri, Matheus Cezimbra [1 ]
Grisci, Bruno Iochins [1 ]
Dorn, Marcio [1 ,2 ,3 ]
机构
[1] Univ Fed Rio Grande do Sul, Inst Informat, BR-90040060 Porto Alegre, RS, Brazil
[2] Univ Fed Rio Grande do Sul, Ctr Biotechnol, BR-90040060 Porto Alegre, RS, Brazil
[3] Natl Inst Forens Sci & Technol, Porto Alegre, RS, Brazil
关键词
Feature selection; Dimensionality reduction; Machine learning; Classification; Stability; Tabular data; CLASSIFICATION; AGGREGATION; ALGORITHMS; RELEVANCE;
D O I
10.1016/j.eswa.2024.123667
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The amount of gathered data is increasing at unprecedented rates for machine learning applications such as natural language processing, computer vision, and bioinformatics. This increase implies a higher number of samples and features; thus, some problems regarding highly dimensional data arise. The curse of dimensionality, small samples, noisy or redundant features, and biased data are among them. Feature selection is fundamental to dealing with such problems. It reduces the data dimensionality by selecting the most relevant and less redundant features. Thus reducing the computational cost, improving accuracy, and enhancing the data's interpretability to machine learning models and domain experts. However, there are several selector options from which to choose. This work compares some of the most representative algorithms from different feature selection groups regarding a broad range of measures, several datasets, and different strategies from diverse perspectives. We employ metrics to appraise selection accuracy, selection redundancy, prediction performance, algorithmic stability, selection reliability, and computational time of several feature selection algorithms. We developed and shared a new open Python framework to benchmark the algorithms. The results highlight the strengths and weaknesses of these algorithms and can guide their application.
引用
收藏
页数:32
相关论文
共 61 条
[1]   Supervised, Unsupervised, and Semi-Supervised Feature Selection: A Review on Gene Selection [J].
Ang, Jun Chin ;
Mirzal, Andri ;
Haron, Habibollah ;
Hamed, Haza Nuzly Abdull .
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2016, 13 (05) :971-989
[2]   Hemogram data as a tool for decision-making in COVID-19 management: applications to resource scarcity scenarios [J].
Avila, Eduardo ;
Kahmann, Alessandro ;
Alho, Clarice ;
Dorn, Marcio .
PEERJ, 2020, 8
[3]  
Awada W, 2012, 2012 IEEE 13TH INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IRI), P356, DOI 10.1109/IRI.2012.6303031
[4]   A review of microarray datasets and applied feature selection methods [J].
Bolon-Canedo, V. ;
Sanchez-Marono, N. ;
Alonso-Betanzos, A. ;
Benitez, J. M. ;
Herrera, F. .
INFORMATION SCIENCES, 2014, 282 :111-135
[5]   Deterministic Feature Selection for k-Means Clustering [J].
Boutsidis, Christos ;
Magdon-Ismail, Malik .
IEEE TRANSACTIONS ON INFORMATION THEORY, 2013, 59 (09) :6099-6110
[6]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[7]   An Experimental Comparison of Feature-Selection and Classification Methods for Microarray Datasets [J].
Cilia, Nicole Dalia ;
De Stefano, Claudio ;
Fontanella, Francesco ;
Raimondo, Stefano ;
di Freca, Alessandra Scotto .
INFORMATION, 2019, 10 (03)
[8]   SUPPORT-VECTOR NETWORKS [J].
CORTES, C ;
VAPNIK, V .
MACHINE LEARNING, 1995, 20 (03) :273-297
[9]  
Diaz-Gomez Pedro A., 2007, 2007 International Conference on Genetic and Evolutionary Methods (GEM'07), P43
[10]   Comparison of machine learning techniques to handle imbalanced COVID-19 CBC datasets [J].
Dorn, Marcio ;
Grisci, Bruno Iochins ;
Narloch, Pedro Henrique ;
Feltes, Bruno Cesar ;
Avila, Eduardo ;
Kahmann, Alessandro ;
Alho, Clarice Sampaio .
PEERJ COMPUTER SCIENCE, 2021, 7 :1-34