Identifying Informative Predictor Variables With Random Forests

被引:15
作者
Rothacher, Yannick [1 ]
Strobl, Carolin [1 ]
机构
[1] Univ Zurich, Psychol Methods Evaluat & Stat, Binzmuehlestr 14,Box 27, CH-8050 Zurich, Switzerland
关键词
random forest; variable importance; interpretable machine learning; recursive partitioning; variable selection; COGNITIVE DIAGNOSIS; PROGRESSIVE MATRICES; SELECTION METHODS; RESPONSE-TIMES; MODEL; CLASSIFICATION; STRATEGIES; CHOICE; PERFORMANCE; FRAMEWORK;
D O I
10.3102/10769986231193327
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
Random forests are a nonparametric machine learning method, which is currently gaining popularity in the behavioral sciences. Despite random forests' potential advantages over more conventional statistical methods, a remaining question is how reliably informative predictor variables can be identified by means of random forests. The present study aims at giving a comprehensible introduction to the topic of variable selection with random forests and providing an overview of the currently proposed selection methods. Using simulation studies, the variable selection methods are examined regarding their statistical properties, and comparisons between their performances and the performance of a conventional linear model are drawn. Advantages and disadvantages of the examined methods are discussed, and practical recommendations for the use of random forests for variable selection are given.
引用
收藏
页码:595 / 629
页数:35
相关论文
共 43 条
[31]   Definitions, methods, and applications in interpretable machine learning [J].
Murdoch, W. James ;
Singh, Chandan ;
Kumbier, Karl ;
Abbasi-Asl, Reza ;
Yu, Bin .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2019, 116 (44) :22071-22080
[32]   The behaviour of random forest permutation-based variable importance measures under predictor correlation [J].
Nicodemus, Kristin K. ;
Malley, James D. ;
Strobl, Carolin ;
Ziegler, Andreas .
BMC BIOINFORMATICS, 2010, 11
[33]   Predictive Modeling With Psychological Panel Data [J].
Pargent, Florian ;
Albert-von der Goenna, Johannes .
ZEITSCHRIFT FUR PSYCHOLOGIE-JOURNAL OF PSYCHOLOGY, 2018, 226 (04) :246-258
[34]   Comparison of variable selection methods for clinical predictive modeling [J].
Sanchez-Pinto, L. Nelson ;
Venable, Laura Ruth ;
Fahrenbach, John ;
Churpek, Matthew M. .
INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2018, 116 :10-17
[35]   A feature selection method for multilevel mental fatigue EEG classification [J].
Shen, Kai-Quan ;
Ong, Chong-Jin ;
Li, Xiao-Ping ;
Hui, Zheng ;
Wilder-Sniith, Einar P. V. .
IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, 2007, 54 (07) :1231-1237
[36]   A comparison of random forest variable selection methods for classification prediction modeling [J].
Speiser, Jaime Lynn ;
Miller, Michael E. ;
Tooze, Janet ;
Ip, Edward .
EXPERT SYSTEMS WITH APPLICATIONS, 2019, 134 :93-101
[37]   Bias in random forest variable importance measures: Illustrations, sources and a solution [J].
Strobl, Carolin ;
Boulesteix, Anne-Laure ;
Zeileis, Achim ;
Hothorn, Torsten .
BMC BIOINFORMATICS, 2007, 8 (1)
[38]   An Introduction to Recursive Partitioning: Rationale, Application, and Characteristics of Classification and Regression Trees, Bagging, and Random Forests [J].
Strobl, Carolin ;
Malley, James ;
Tutz, Gerhard .
PSYCHOLOGICAL METHODS, 2009, 14 (04) :323-348
[39]   Conditional variable importance for random forests [J].
Strobl, Carolin ;
Boulesteix, Anne-Laure ;
Kneib, Thomas ;
Augustin, Thomas ;
Zeileis, Achim .
BMC BIOINFORMATICS, 2008, 9 (1)
[40]   r2VIM: A new variable selection method for random forests in genome-wide association studies [J].
Szymczak, Silke ;
Holzinger, Emily ;
Dasgupta, Abhijit ;
Malley, James D. ;
Molloy, Anne M. ;
Mills, James L. ;
Brody, Lawrence C. ;
Stambolian, Dwight ;
Bailey-Wilson, Joan E. .
BIODATA MINING, 2016, 9