Genetic Programming for Feature Selection Based on Feature Removal Impact in High-Dimensional Symbolic Regression

被引：5

作者：

Al-Helali, Baligh ^{[1
,2
]}

Chen, Qi ^{[1
,2
]}

Xue, Bing ^{[1
,2
]}

Zhang, Mengjie ^{[1
,2
]}

机构：

[1] Victoria Univ Wellington, Ctr Data Sci & Artificial Intelligence, Wellington 6140, New Zealand

[2] Victoria Univ Wellington, Sch Engn & Comp Sci, Wellington 6140, New Zealand

来源：

IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE | 2024年 / 8卷 / 03期

关键词：

Feature selection; genetic programming; high dimensionality; symbolic regression; FEATURE RANKING; CLASSIFICATION; EVOLUTIONARY;

D O I：

10.1109/TETCI.2024.3369407

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Symbolic regression is increasingly important for discovering mathematical models for various prediction tasks. It works by searching for the arithmetic expressions that best represent a target variable using a set of input features. However, as the number of features increases, the search process becomes more complex. To address high-dimensional symbolic regression, this work proposes a genetic programming for feature selection method based on the impact of feature removal on the performance of SR models. Unlike existing Shapely value methods that simulate feature absence at the data level, the proposed approach suggests removing features at the model level. This approach circumvents the production of unrealistic data instances, which is a major limitation of Shapely value and permutation-based methods. Moreover, after calculating the importance of the features, a cut-off strategy, which works by injecting a number of random features and utilising their importance to automatically set a threshold, is proposed for selecting important features. The experimental results on artificial and real-world high-dimensional data sets show that, compared with state-of-the-art feature selection methods using the permutation importance and Shapely value, the proposed method not only improves the SR accuracy but also selects smaller sets of features.

引用

页码：2269 / 2282

页数：14

共 50 条

[41] Feature selection based on geometric distance for high-dimensional data
Lee, J. -H.
Oh, S. -Y.
ELECTRONICS LETTERS, 2016, 52 (06) : 473 - 474
[42] Feature selection for high-dimensional imbalanced data
Yin, Liuzhi
Ge, Yong
Xiao, Keli
Wang, Xuehua
Quan, Xiaojun
NEUROCOMPUTING, 2013, 105 : 3 - 11
[43] Feature selection for high-dimensional data in astronomy
Zheng, Hongwen
Zhang, Yanxia
ADVANCES IN SPACE RESEARCH, 2008, 41 (12) : 1960 - 1964
[44] Feature selection for high-dimensional temporal data
Tsagris, Michail
Lagani, Vincenzo
Tsamardinos, Ioannis
BMC BIOINFORMATICS, 2018, 19
[45] FEATURE SELECTION FOR HIGH-DIMENSIONAL DATA ANALYSIS
Verleysen, Michel
ECTA 2011/FCTA 2011: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON EVOLUTIONARY COMPUTATION THEORY AND APPLICATIONS AND INTERNATIONAL CONFERENCE ON FUZZY COMPUTATION THEORY AND APPLICATIONS, 2011,
[46] Extremely High-Dimensional Feature Selection via Feature Generating Samplings
Li, Shutao
Wei, Dan
IEEE TRANSACTIONS ON CYBERNETICS, 2014, 44 (06) : 737 - 747
[47] A review on advancements in feature selection and feature extraction for high-dimensional NGS data analysis
Borah, Kasmika
Das, Himanish Shekhar
Seth, Soumita
Mallick, Koushik
Rahaman, Zubair
Mallik, Saurav
FUNCTIONAL & INTEGRATIVE GENOMICS, 2024, 24 (05)
[48] FEATURE SELECTION FOR HIGH-DIMENSIONAL DATA ANALYSIS
Verleysen, Michel
NCTA 2011: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON NEURAL COMPUTATION THEORY AND APPLICATIONS, 2011, : IS23 - IS25
[49] Fully Bayesian logistic regression with hyper-LASSO priors for high-dimensional feature selection
Li, Longhai
Yao, Weixin
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION, 2018, 88 (14) : 2827 - 2851
[50] Feature Selection with High-Dimensional Imbalanced Data
Van Hulse, Jason
Khoshgoftaar, Taghi M.
Napolitano, Amri
Wald, Randall
2009 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2009), 2009, : 507 - 514

← 1 2 3 4 5 →