Approximation of SHAP Values for Randomized Tree Ensembles

被引:7
作者
Loecher, Markus [1 ]
Lai, Dingyi [2 ]
Qi, Wu [2 ]
机构
[1] Berlin Sch Econ & Law, D-10825 Berlin, Germany
[2] Humboldt Univ, Dept Stat, Berlin, Germany
来源
MACHINE LEARNING AND KNOWLEDGE EXTRACTION, CD-MAKE 2022 | 2022年 / 13480卷
关键词
SHAP values; Saabas value; Variable importance; Random forests; Boosting; GINI impurity;
D O I
10.1007/978-3-031-14463-9_2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Classification and regression trees offer straightforward methods of attributing importance values to input features, either globally or for a single prediction. Conditional feature contributions (CFCs) yield local, case-by-case explanations of a prediction by following the decision path and attributing changes in the expected output of the model to each feature along the path. However, CFCs suffer from a potential bias which depends on the distance from the root of a tree. The by now immensely popular alternative, SHapley Additive exPlanation (SHAP) values appear to mitigate this bias but are computationally much more expensive. Here we contribute a thorough, empirical comparison of the explanations computed by both methods on a set of 164 publicly available classification problems in order to provide data-driven algorithm recommendations to current researchers. For random forests and boosted trees, we find extremely high similarities and correlations of both local and global SHAP values and CFC scores, leading to very similar rankings and interpretations. Unsurprisingly, these insights extend to the fidelity of using global feature importance scores as a proxy for the predictive power associated with each feature.
引用
收藏
页码:19 / 30
页数:12
相关论文
共 20 条
[1]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[2]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[3]  
Coleman T, 2019, Arxiv, DOI [arXiv:1904.07830, DOI 10.48550/ARXIV.1904.07830]
[4]  
Covert IC, 2020, ADV NEUR IN, V33
[5]   Gene selection and classification of microarray data using random forest -: art. no. 3 [J].
Díaz-Uriarte, R ;
de Andrés, SA .
BMC BIOINFORMATICS, 2006, 7 (1)
[6]  
Dua D, 2017, UCI machine learning repository
[7]   Variable Importance Assessment in Regression: Linear Regression versus Random Forest [J].
Groemping, Ulrike .
AMERICAN STATISTICIAN, 2009, 63 (04) :308-319
[8]   Classification trees with unbiased multiway splits [J].
Kim, H ;
Loh, WY .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2001, 96 (454) :589-604
[9]  
Loecher M., 2020, arXiv
[10]   Unbiased variable importance for random forests [J].
Loecher, Markus .
COMMUNICATIONS IN STATISTICS-THEORY AND METHODS, 2022, 51 (05) :1413-1425