An Evaluation of Feature Selection Robustness on Class Noisy Data

被引:1
作者
Pau, Simone [1 ]
Perniciano, Alessandra [1 ]
Pes, Barbara [1 ]
Rubattu, Dario [1 ]
Jia, Heming [1 ]
机构
[1] Univ Cagliari, Dept Math & Comp Sci, Via Osped 72, I-09124 Cagliari, Italy
关键词
feature selection; high-dimensional and imbalanced data; noisy data; robustness to noise; GENE-EXPRESSION; LABEL NOISE; CLASSIFICATION; CANCER; SIMILARITY; PREDICTION;
D O I
10.3390/info14080438
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the increasing growth of data dimensionality, feature selection has become a crucial step in a variety of machine learning and data mining applications. In fact, it allows identifying the most important attributes of the task at hand, improving the efficiency, interpretability, and final performance of the induced models. In recent literature, several studies have examined the strengths and weaknesses of the available feature selection methods from different points of view. Still, little work has been performed to investigate how sensitive they are to the presence of noisy instances in the input data. This is the specific field in which our work wants to make a contribution. Indeed, since noise is arguably inevitable in several application scenarios, it would be important to understand the extent to which the different selection heuristics can be affected by noise, in particular class noise (which is more harmful in supervised learning tasks). Such an evaluation may be especially important in the context of class-imbalanced problems, where any perturbation in the set of training records can strongly affect the final selection outcome. In this regard, we provide here a two-fold contribution by presenting (i) a general methodology to evaluate feature selection robustness on class noisy data and (ii) an experimental study that involves different selection methods, both univariate and multivariate. The experiments have been conducted on eight high-dimensional datasets chosen to be representative of different real-world domains, with interesting insights into the intrinsic degree of robustness of the considered selection approaches.
引用
收藏
页数:19
相关论文
共 52 条
[1]   Filter-Based Subset Selection for Easy, Moderate, and Hard Bioinformatics Data [J].
Abu Shanab, Ahmad ;
Khoshgoftaar, Taghi M. .
2018 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IRI), 2018, :372-377
[2]   A Survey on Hybrid Feature Selection Methods in Microarray Gene Expression Data for Cancer Classification [J].
Almugren, Nada ;
Alshamlan, Hala .
IEEE ACCESS, 2019, 7 :78533-78548
[3]  
Altidor W, 2011, 2011 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IRI), P240, DOI 10.1109/IRI.2011.6009553
[4]  
Anyfamis D, 2007, INT FED INFO PROC, P21
[5]  
Arun S., 2019, Machine Learning Research, DOI [https://doi.org/10.13140/RG.2.2.25669.91369, DOI 10.13140/RG.2.2.25669.91369, 10.13140/RG.2.2.25669.91369]
[6]   Gene-expression profiles predict survival of patients with lung adenocarcinoma [J].
Beer, DG ;
Kardia, SLR ;
Huang, CC ;
Giordano, TJ ;
Levin, AM ;
Misek, DE ;
Lin, L ;
Chen, GA ;
Gharib, TG ;
Thomas, DG ;
Lizyness, ML ;
Kuick, R ;
Hayasaka, S ;
Taylor, JMG ;
Iannettoni, MD ;
Orringer, MB ;
Hanash, S .
NATURE MEDICINE, 2002, 8 (08) :816-824
[7]  
Bolon-Canedo V., 2022, Advances in Selected Artificial Intelligence Areas: World Outstanding Women in Artificial Intelligence, P11
[8]   Feature selection for high-dimensional data [J].
Bolón-Canedo V. ;
Sánchez-Maroño N. ;
Alonso-Betanzos A. .
Progress in Artificial Intelligence, 2016, 5 (02) :65-75
[9]   A review of feature selection methods on synthetic data [J].
Bolon-Canedo, Veronica ;
Sanchez-Marono, Noelia ;
Alonso-Betanzos, Amparo .
KNOWLEDGE AND INFORMATION SYSTEMS, 2013, 34 (03) :483-519
[10]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32