Correlation and variable importance in random forests

被引:620
作者
Gregorutti, Baptiste [1 ,2 ]
Michel, Bertrand [2 ]
Saint-Pierre, Philippe [2 ]
机构
[1] Safety Line, 15 Rue Jean Baptiste Berlier, F-75013 Paris, France
[2] Univ Paris 06, Lab Stat Theor & Appl, 4 Pl Jussieu, F-75252 Paris 05, France
关键词
Random forests; Supervised learning; Variable importance; Variable selection; GENE SELECTION; CLASSIFICATION; STABILITY; FEATURES;
D O I
10.1007/s11222-016-9646-1
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
This paper is about variable selection with the random forests algorithm in presence of correlated predictors. In high-dimensional regression or classification frameworks, variable selection is a difficult task, that becomes even more challenging in the presence of highly correlated predictors. Firstly we provide a theoretical study of the permutation importance measure for an additive regression model. This allows us to describe how the correlation between predictors impacts the permutation importance. Our results motivate the use of the recursive feature elimination (RFE) algorithm for variable selection in this context. This algorithm recursively eliminates the variables using permutation importance measure as a ranking criterion. Next various simulation experiments illustrate the efficiency of the RFE algorithm for selecting a small number of variables together with a good prediction error. Finally, this selection algorithm is tested on the Landsat Satellite data from the UCI Machine Learning Repository.
引用
收藏
页码:659 / 678
页数:20
相关论文
共 50 条
  • [21] Variable importance for sustaining macrophyte presence via random forests: data imputation and model settings
    Van Echelpoel, Wout
    Goethals, Peter L. M.
    SCIENTIFIC REPORTS, 2018, 8
  • [22] A Study of Strength and Correlation in Random Forests
    Bernard, Simon
    Heutte, Laurent
    Adam, Sebastien
    ADVANCED INTELLIGENT COMPUTING THEORIES AND APPLICATIONS, 2010, 93 : 186 - 191
  • [23] Margin Based Permutation Variable Importance: a Stable Importance Measure for Random Forest
    Pei, Liu
    Lai, Yongxuan
    Piao, Peng
    Yang, Fan
    2017 12TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS AND KNOWLEDGE ENGINEERING (IEEE ISKE), 2017,
  • [24] Improving land cover classification in an urbanized coastal area by random forests: The role of variable selection
    Zhang, Fang
    Yang, Xiaojun
    REMOTE SENSING OF ENVIRONMENT, 2020, 251
  • [25] Variable importance in binary regression trees and forests
    Ishwaran, Hemant
    ELECTRONIC JOURNAL OF STATISTICS, 2007, 1 : 519 - 537
  • [27] Towards a Better Understanding of Random Forests through the Study of Strength and Correlation
    Bernard, Simon
    Heutte, Laurent
    Adam, Sebastien
    EMERGING INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS: WITH ASPECTS OF ARTIFICIAL INTELLIGENCE, 2009, 5755 : 536 - 545
  • [28] Estimating neuronal variable importance with random forest
    Oh, J
    Laubach, M
    Luczak, A
    PROCEEDINGS OF THE IEEE 29TH ANNUAL NORTHEAST BIOENGINEERING CONFERENCE, 2003, : 33 - 34
  • [29] Bias in random forest variable importance measures: Illustrations, sources and a solution
    Carolin Strobl
    Anne-Laure Boulesteix
    Achim Zeileis
    Torsten Hothorn
    BMC Bioinformatics, 8
  • [30] Random forests for genomic data analysis
    Chen, Xi
    Ishwaran, Hemant
    GENOMICS, 2012, 99 (06) : 323 - 329