A computationally fast variable importance test for random forests for high-dimensional data

被引:130
作者
Janitza, Silke [1 ]
Celik, Ender [1 ]
Boulesteix, Anne-Laure [1 ]
机构
[1] Univ Munich, Dept Med Informat Biometry & Epidemiol, Marchioninistr 15, D-81377 Munich, Germany
关键词
Gene selection; Feature selection; Random forests; Variable importance; Variable selection; Variable importance test; REGRESSION TREES; CLASSIFICATION; PREDICTION; TUMOR; CANCER;
D O I
10.1007/s11634-016-0276-4
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Random forests are a commonly used tool for classification and for ranking candidate predictors based on the so-called variable importance measures. These measures attribute scores to the variables reflecting their importance. A drawback of variable importance measures is that there is no natural cutoff that can be used to discriminate between important and non-important variables. Several approaches, for example approaches based on hypothesis testing, were developed for addressing this problem. The existing testing approaches require the repeated computation of random forests. While for low-dimensional settings those approaches might be computationally tractable, for high-dimensional settings typically including thousands of candidate predictors, computing time is enormous. In this article a computationally fast heuristic variable importance test is proposed that is appropriate for high-dimensional data where many variables do not carry any information. The testing approach is based on a modified version of the permutation variable importance, which is inspired by cross-validation procedures. The new approach is tested and compared to the approach of Altmann and colleagues using simulation studies, which are based on real data from high-dimensional binary classification settings. The new approach controls the type I error and has at least comparable power at a substantially smaller computation time in the studies. Thus, it might be used as a computationally fast alternative to existing procedures for high-dimensional data settings where many variables do not carry any information. The new approach is implemented in the R package vita.
引用
收藏
页码:885 / 915
页数:31
相关论文
共 50 条
  • [21] Bayesian weighted random forest for classification of high-dimensional genomics data
    Olaniran, Oyebayo Ridwan
    Abdullah, Mohd Asrul A.
    [J]. KUWAIT JOURNAL OF SCIENCE, 2023, 50 (04) : 477 - 484
  • [22] Ensemble of penalized logistic models for classification of high-dimensional data
    Ijaz, Musarrat
    Asghar, Zahid
    Gul, Asma
    [J]. COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2021, 50 (07) : 2072 - 2088
  • [23] ranger: A Fast Implementation of Random Forests for High Dimensional Data in C plus plus and R
    Wright, Marvin N.
    Ziegler, Andreas
    [J]. JOURNAL OF STATISTICAL SOFTWARE, 2017, 77 (01): : 1 - 17
  • [24] A new variable importance measure for random forests with missing data
    Hapfelmeier, Alexander
    Hothorn, Torsten
    Ulm, Kurt
    Strobl, Carolin
    [J]. STATISTICS AND COMPUTING, 2014, 24 (01) : 21 - 34
  • [25] A new variable importance measure for random forests with missing data
    Alexander Hapfelmeier
    Torsten Hothorn
    Kurt Ulm
    Carolin Strobl
    [J]. Statistics and Computing, 2014, 24 : 21 - 34
  • [26] Variable Importance in High-Dimensional Settings Requires Grouping
    Chamma, Ahmad
    Thirion, Bertrand
    Engemann, Denis
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 10, 2024, : 11195 - 11203
  • [27] The feature selection bias problem in relation to high-dimensional gene data
    Krawczuk, Jerzy
    Lukaszuk, Tomasz
    [J]. ARTIFICIAL INTELLIGENCE IN MEDICINE, 2016, 66 : 63 - 71
  • [28] An efficient random forests algorithm for high dimensional data classification
    Qiang Wang
    Thanh-Tung Nguyen
    Joshua Z. Huang
    Thuy Thi Nguyen
    [J]. Advances in Data Analysis and Classification, 2018, 12 : 953 - 972
  • [29] An efficient random forests algorithm for high dimensional data classification
    Wang, Qiang
    Thanh-Tung Nguyen
    Huang, Joshua Z.
    Thuy Thi Nguyen
    [J]. ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2018, 12 (04) : 953 - 972
  • [30] Development of biomarker classifiers from high-dimensional data
    Baek, Songjoon
    Tsai, Chen-An
    Chen, James J.
    [J]. BRIEFINGS IN BIOINFORMATICS, 2009, 10 (05) : 537 - 546