Consistent and unbiased variable selection under indepedent features using Random Forest permutation importance

被引:6
作者
Ramosaj, Burim [1 ]
Pauly, Markus [1 ]
机构
[1] Tech Univ Dortmund, Inst Math Stat & Applicat Ind, Fac Stat, Joseph Von Fraunhofer Str 2-4, D-44227 Dortmund, Germany
关键词
Random Forest; permutation importance; unbiasedness; consistency; Out-of-Bag samples; statistical learning;
D O I
10.3150/22-BEJ1534
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Variable selection in sparse regression models is an important task as applications ranging from biomedical re-search to econometrics have shown. Especially for higher dimensional regression problems, for which the regres-sion function as the link between response and covariates cannot be directly detected, the selection of informative variables is challenging. Under these circumstances, the Random Forest method is an helpful tool to predict new outcomes while delivering measures for variable selection. One common approach is the usage of the permutation importance. Due to its intuitive idea and flexible usage, it is important to explore circumstances, for which the permutation importance based on Random Forest correctly indicates informative covariates. Regarding the latter, we deliver theoretical guarantees for the validity of the permutation importance measure under specific assump-tions such as the mutual independence of the features and prove its (asymptotic) unbiasedness, while under slightly stricter assumptions, consistency of the permutation importance measure is established. An extensive simulation study supports our findings.
引用
收藏
页码:2101 / 2118
页数:18
相关论文
共 23 条
[1]   Permutation importance: a corrected feature importance measure [J].
Altmann, Andre ;
Tolosi, Laura ;
Sander, Oliver ;
Lengauer, Thomas .
BIOINFORMATICS, 2010, 26 (10) :1340-1347
[2]  
[Anonymous], The Elements of Statistical Learning: Data Mining, Inference, and Prediction, DOI DOI 10.1007/978-0-387-84858-7
[3]   Empirical characterization of random forest variable importance measures [J].
Archer, Kelfie J. ;
Kirnes, Ryan V. .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2008, 52 (04) :2249-2260
[4]  
Breiman L, 2001, MACH LEARN, V45, P5, DOI [10.1186/s12859-018-2419-4, 10.3322/caac.21834]
[5]  
Chi CM, 2022, Arxiv, DOI arXiv:2004.13953
[6]  
Coleman T, 2019, Arxiv, DOI arXiv:1904.07830
[7]   Variable selection using random forests [J].
Genuer, Robin ;
Poggi, Jean-Michel ;
Tuleau-Malot, Christine .
PATTERN RECOGNITION LETTERS, 2010, 31 (14) :2225-2236
[8]   Correlation and variable importance in random forests [J].
Gregorutti, Baptiste ;
Michel, Bertrand ;
Saint-Pierre, Philippe .
STATISTICS AND COMPUTING, 2017, 27 (03) :659-678
[9]  
Guyon I., 2003, Journal of Machine Learning Research, V3, P1157, DOI 10.1162/153244303322753616
[10]  
Louppe G., 2013, ADV NEURAL INFORM PR, P431