Correlation and variable importance in random forests

被引:619
作者
Gregorutti, Baptiste [1 ,2 ]
Michel, Bertrand [2 ]
Saint-Pierre, Philippe [2 ]
机构
[1] Safety Line, 15 Rue Jean Baptiste Berlier, F-75013 Paris, France
[2] Univ Paris 06, Lab Stat Theor & Appl, 4 Pl Jussieu, F-75252 Paris 05, France
关键词
Random forests; Supervised learning; Variable importance; Variable selection; GENE SELECTION; CLASSIFICATION; STABILITY; FEATURES;
D O I
10.1007/s11222-016-9646-1
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
This paper is about variable selection with the random forests algorithm in presence of correlated predictors. In high-dimensional regression or classification frameworks, variable selection is a difficult task, that becomes even more challenging in the presence of highly correlated predictors. Firstly we provide a theoretical study of the permutation importance measure for an additive regression model. This allows us to describe how the correlation between predictors impacts the permutation importance. Our results motivate the use of the recursive feature elimination (RFE) algorithm for variable selection in this context. This algorithm recursively eliminates the variables using permutation importance measure as a ranking criterion. Next various simulation experiments illustrate the efficiency of the RFE algorithm for selecting a small number of variables together with a good prediction error. Finally, this selection algorithm is tested on the Landsat Satellite data from the UCI Machine Learning Repository.
引用
收藏
页码:659 / 678
页数:20
相关论文
共 50 条
  • [41] Random Shapley Forests: Cooperative Game-Based Random Forests With Consistency
    Sun, Jianyuan
    Yu, Hui
    Zhong, Guoqiang
    Dong, Junyu
    Zhang, Shu
    Yu, Hongchuan
    [J]. IEEE TRANSACTIONS ON CYBERNETICS, 2022, 52 (01) : 205 - 214
  • [42] Banzhaf random forests: Cooperative game theory based random forests with consistency
    Sun, Jianyuan
    Zhong, Guoqiang
    Huang, Kaizhu
    Dong, Junyu
    [J]. NEURAL NETWORKS, 2018, 106 : 20 - 29
  • [43] Unbiased split variable selection for random survival forests using maximally selected rank statistics
    Wright, Marvin N.
    Dankowski, Theresa
    Ziegler, Andreas
    [J]. STATISTICS IN MEDICINE, 2017, 36 (08) : 1272 - 1284
  • [44] Gradient forests: calculating importance gradients on physical predictors
    Ellis, Nick
    Smith, Stephen J.
    Pitcher, C. Roland
    [J]. ECOLOGY, 2012, 93 (01) : 156 - 168
  • [45] Causal Random Forests Model Using Instrumental Variable Quantile Regression
    Chen, Jau-er
    Hsiang, Chen-Wei
    [J]. ECONOMETRICS, 2019, 7 (04)
  • [46] Evaluation of variable selection methods for random forests and omics data sets
    Degenhardt, Frauke
    Seifert, Stephan
    Szymczak, Silke
    [J]. BRIEFINGS IN BIOINFORMATICS, 2019, 20 (02) : 492 - 503
  • [47] Random forests for global sensitivity analysis: A selective review
    Antoniadis, Anestis
    Lambert-Lacroix, Sophie
    Poggi, Jean-Michel
    [J]. RELIABILITY ENGINEERING & SYSTEM SAFETY, 2021, 206
  • [48] Random forests for feature selection in QSPR Models - an application for predicting standard enthalpy of formation of hydrocarbons
    Teixeira, Ana L.
    Leal, Joao P.
    Falcao, Andre O.
    [J]. JOURNAL OF CHEMINFORMATICS, 2013, 5
  • [49] Mean decrease accuracy for random forests: inconsistency, and a practical solution via the Sobol-MDA
    Benard, Clement
    Da Veiga, Sebastien
    Scornet, Erwan
    [J]. BIOMETRIKA, 2022, 109 (04) : 881 - 900
  • [50] Random Forests for Time Series
    Goehry, Benjamin
    Yan, Hui
    Goude, Yannig
    Massart, Pascal
    Poggi, Jean-Michel
    [J]. REVSTAT-STATISTICAL JOURNAL, 2023, 21 (02) : 283 - 302