A computationally fast variable importance test for random forests for high-dimensional data

被引:130
|
作者
Janitza, Silke [1 ]
Celik, Ender [1 ]
Boulesteix, Anne-Laure [1 ]
机构
[1] Univ Munich, Dept Med Informat Biometry & Epidemiol, Marchioninistr 15, D-81377 Munich, Germany
关键词
Gene selection; Feature selection; Random forests; Variable importance; Variable selection; Variable importance test; REGRESSION TREES; CLASSIFICATION; PREDICTION; TUMOR; CANCER;
D O I
10.1007/s11634-016-0276-4
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Random forests are a commonly used tool for classification and for ranking candidate predictors based on the so-called variable importance measures. These measures attribute scores to the variables reflecting their importance. A drawback of variable importance measures is that there is no natural cutoff that can be used to discriminate between important and non-important variables. Several approaches, for example approaches based on hypothesis testing, were developed for addressing this problem. The existing testing approaches require the repeated computation of random forests. While for low-dimensional settings those approaches might be computationally tractable, for high-dimensional settings typically including thousands of candidate predictors, computing time is enormous. In this article a computationally fast heuristic variable importance test is proposed that is appropriate for high-dimensional data where many variables do not carry any information. The testing approach is based on a modified version of the permutation variable importance, which is inspired by cross-validation procedures. The new approach is tested and compared to the approach of Altmann and colleagues using simulation studies, which are based on real data from high-dimensional binary classification settings. The new approach controls the type I error and has at least comparable power at a substantially smaller computation time in the studies. Thus, it might be used as a computationally fast alternative to existing procedures for high-dimensional data settings where many variables do not carry any information. The new approach is implemented in the R package vita.
引用
收藏
页码:885 / 915
页数:31
相关论文
共 50 条
  • [1] A computationally fast variable importance test for random forests for high-dimensional data
    Silke Janitza
    Ender Celik
    Anne-Laure Boulesteix
    Advances in Data Analysis and Classification, 2018, 12 : 885 - 915
  • [2] Random forests for high-dimensional longitudinal data
    Capitaine, Louis
    Genuer, Robin
    Thiebaut, Rodolphe
    STATISTICAL METHODS IN MEDICAL RESEARCH, 2021, 30 (01) : 166 - 184
  • [3] Correlation and variable importance in random forests
    Gregorutti, Baptiste
    Michel, Bertrand
    Saint-Pierre, Philippe
    STATISTICS AND COMPUTING, 2017, 27 (03) : 659 - 678
  • [4] Classifying Very High-Dimensional Data with Random Forests Built from Small Subspaces
    Xu, Baoxun
    Huang, Joshua Zhexue
    Williams, Graham
    Wang, Qiang
    Ye, Yunming
    INTERNATIONAL JOURNAL OF DATA WAREHOUSING AND MINING, 2012, 8 (02) : 44 - 63
  • [5] Fuzzy Forests: Extending Random Forest Feature Selection for Correlated, High-Dimensional Data
    Conn, Daniel
    Ngun, Tuck
    Li, Gang
    Ramirez, Christina M.
    JOURNAL OF STATISTICAL SOFTWARE, 2019, 91 (09):
  • [6] Correlation and variable importance in random forests
    Baptiste Gregorutti
    Bertrand Michel
    Philippe Saint-Pierre
    Statistics and Computing, 2017, 27 : 659 - 678
  • [7] Sparse Bayesian variable selection for classifying high-dimensional data
    Yang, Aijun
    Lian, Heng
    Jiang, Xuejun
    Liu, Pengfei
    STATISTICS AND ITS INTERFACE, 2018, 11 (02) : 385 - 395
  • [8] A New Feature Sampling Method in Random Forests for Predicting High-Dimensional Data
    Thanh-Tung Nguyen
    Zhao, He
    Huang, Joshua Zhexue
    Thuy Thi Nguyen
    Li, Mark Junjie
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PART II, 2015, 9078 : 459 - 470
  • [9] Sparse Bayesian variable selection in kernel probit model for analyzing high-dimensional data
    Yang, Aijun
    Tian, Yuzhu
    Li, Yunxian
    Lin, Jinguan
    COMPUTATIONAL STATISTICS, 2020, 35 (01) : 245 - 258
  • [10] Grouped variable importance with random forests and application to multiple functional data analysis
    Gregorutti, Baptiste
    Michel, Bertrand
    Saint-Pierre, Philippe
    COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2015, 90 : 15 - 35