A framework for feature selection through boosting

被引:89
作者
Alsahaf, Ahmad [1 ]
Petkov, Nicolai [1 ]
Shenoy, Vikram [2 ]
Azzopardi, George [1 ]
机构
[1] Univ Groningen, Bernoulli Inst Math Comp Sci & Artificial Intelli, POB 407, NL-9700 AK Groningen, Netherlands
[2] Northeastern Univ, Khoury Coll Comp Sci, West Village Residence Complex H, Boston, MA 02115 USA
关键词
Feature selection; Boosting; Ensemble learning; XGBoost; MUTUAL INFORMATION; OPTIMIZATION;
D O I
10.1016/j.eswa.2021.115895
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
As dimensions of datasets in predictive modelling continue to grow, feature selection becomes increasingly practical. Datasets with complex feature interactions and high levels of redundancy still present a challenge to existing feature selection methods. We propose a novel framework for feature selection that relies on boosting, or sample re-weighting, to select sets of informative features in classification problems. The method uses as its basis the feature rankings derived from fast and scalable tree-boosting models, such as XGBoost. We compare the proposed method to standard feature selection algorithms on 9 benchmark datasets. We show that the proposed approach reaches higher accuracies with fewer features on most of the tested datasets, and that the selected features have lower redundancy.
引用
收藏
页数:10
相关论文
共 52 条
[41]   General framework for class-specific feature selection [J].
Pineda-Bautista, Barbara B. ;
Carrasco-Ochoa, J. A. ;
Fco Martinez-Trinidad, J. .
EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (08) :10018-10024
[42]   Feature selection based on artificial bee colony and gradient boosting decision tree [J].
Rao, Haidi ;
Shi, Xianzhang ;
Rodrigue, Ahoussou Kouassi ;
Feng, Juanjuan ;
Xia, Yingchun ;
Elhoseny, Mohamed ;
Yuan, Xiaohui ;
Gu, Lichuan .
APPLIED SOFT COMPUTING, 2019, 74 :634-642
[43]   A Fast Clustering-Based Feature Subset Selection Algorithm for High-Dimensional Data [J].
Song, Qinbao ;
Ni, Jingjie ;
Wang, Guangtao .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2013, 25 (01) :1-14
[44]   Bias in random forest variable importance measures: Illustrations, sources and a solution [J].
Strobl, Carolin ;
Boulesteix, Anne-Laure ;
Zeileis, Achim ;
Hothorn, Torsten .
BMC BIOINFORMATICS, 2007, 8 (1)
[45]   Conditional variable importance for random forests [J].
Strobl, Carolin ;
Boulesteix, Anne-Laure ;
Kneib, Thomas ;
Augustin, Thomas ;
Zeileis, Achim .
BMC BIOINFORMATICS, 2008, 9 (1)
[46]  
Tabus I, 2005, EURASIP BOOK SER SIG, V2, P67
[47]  
Tang J, 2014, DATA CLASSIFICATION, DOI DOI 10.1201/B17320
[48]   Boosting image retrieval [J].
Tieu, K ;
Viola, P .
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2004, 56 (1-2) :17-36
[49]  
Tuv E, 2009, J MACH LEARN RES, V10, P1341
[50]   Relief-based feature selection: Introduction and review [J].
Urbanowicz, Ryan J. ;
Meeker, Melissa ;
La Cava, William ;
Olson, Randal S. ;
Moore, Jason H. .
JOURNAL OF BIOMEDICAL INFORMATICS, 2018, 85 :189-203