Identifying Informative Predictor Variables With Random Forests

被引:11
作者
Rothacher, Yannick [1 ]
Strobl, Carolin [1 ]
机构
[1] Univ Zurich, Psychol Methods Evaluat & Stat, Binzmuehlestr 14,Box 27, CH-8050 Zurich, Switzerland
关键词
random forest; variable importance; interpretable machine learning; recursive partitioning; variable selection; SELECTION METHODS; CLASSIFICATION; TREES;
D O I
10.3102/10769986231193327
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
Random forests are a nonparametric machine learning method, which is currently gaining popularity in the behavioral sciences. Despite random forests' potential advantages over more conventional statistical methods, a remaining question is how reliably informative predictor variables can be identified by means of random forests. The present study aims at giving a comprehensible introduction to the topic of variable selection with random forests and providing an overview of the currently proposed selection methods. Using simulation studies, the variable selection methods are examined regarding their statistical properties, and comparisons between their performances and the performance of a conventional linear model are drawn. Advantages and disadvantages of the examined methods are discussed, and practical recommendations for the use of random forests for variable selection are given.
引用
收藏
页码:595 / 629
页数:35
相关论文
共 43 条
  • [1] Permutation importance: a corrected feature importance measure
    Altmann, Andre
    Tolosi, Laura
    Sander, Oliver
    Lengauer, Thomas
    [J]. BIOINFORMATICS, 2010, 26 (10) : 1340 - 1347
  • [2] Predicting Cognitive Impairment and Dementia: A Machine Learning Approach
    Aschwanden, Damaris
    Aichele, Stephen
    Ghisletta, Paolo
    Terracciano, Antonio
    Kliegel, Matthias
    Sutin, Angelina R.
    Brown, Justin
    Allemand, Mathias
    [J]. JOURNAL OF ALZHEIMERS DISEASE, 2020, 75 (03) : 717 - 728
  • [3] VALID POST-SELECTION INFERENCE
    Berk, Richard
    Brown, Lawrence
    Buja, Andreas
    Zhang, Kai
    Zhao, Linda
    [J]. ANNALS OF STATISTICS, 2013, 41 (02) : 802 - 837
  • [4] The role of overlapping excitatory symptoms in major depression: are they relevant for the diagnosis of mixed state?
    Brancati, Giulio E.
    Vieta, Eduard
    Azorin, Jean-Michel
    Angst, Jules
    Bowden, Charles L.
    Mosolov, Sergey
    Young, Allan H.
    Perugi, Giulio
    [J]. JOURNAL OF PSYCHIATRIC RESEARCH, 2019, 115 : 151 - 157
  • [5] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
  • [6] Random forest versus logistic regression: a large-scale benchmark experiment
    Couronne, Raphael
    Probst, Philipp
    Boulesteix, Anne-Laure
    [J]. BMC BIOINFORMATICS, 2018, 19
  • [7] Conditional permutation importance revisited
    Debeer, Dries
    Strobl, Carolin
    [J]. BMC BIOINFORMATICS, 2020, 21 (01)
  • [8] Evaluation of variable selection methods for random forests and omics data sets
    Degenhardt, Frauke
    Seifert, Stephan
    Szymczak, Silke
    [J]. BRIEFINGS IN BIOINFORMATICS, 2019, 20 (02) : 492 - 503
  • [9] Gene selection and classification of microarray data using random forest -: art. no. 3
    Díaz-Uriarte, R
    de Andrés, SA
    [J]. BMC BIOINFORMATICS, 2006, 7 (1)
  • [10] Fithian W., 2014, J AM STAT ASSOC, V99, P751, DOI [10.1198/016214504000001097, DOI 10.1198/016214504000001097]