Adapting and Evaluating Influence-Estimation Methods for Gradient-Boosted Decision Trees

被引:0
作者
Brophy, Jonathan [1 ]
Hammoudeh, Zayd [1 ]
Lowd, Daniel [1 ]
机构
[1] Univ Oregon, Dept Comp & Informat Sci, Eugene, OR 97403 USA
关键词
training data influence; gradient-boosted decision trees; instance attribution; influence functions; TracIn; PERFORMANCE; SELECTION;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Influence estimation analyzes how changes to the training data can lead to different model predictions; this analysis can help us better understand these predictions, the models making those predictions, and the data sets they're trained on. However, most influenceestimation techniques are designed for deep learning models with continuous parameters. Gradient-boosted decision trees (GBDTs) are a powerful and widely-used class of models; however, these models are black boxes with opaque decision-making processes. In the pursuit of better understanding GBDT predictions and generally improving these models, we adapt recent and popular influence-estimation methods designed for deep learning models to GBDTs. Specifically, we adapt representer-point methods and TracIn, denoting our new methods TREX and BoostIn, respectively; source code is available at https://github.com/jjbrophy47/tree_influence. We compare these methods to LeafInfluence and other baselines using 5 different evaluation measures on 22 real-world data sets with 4 popular GBDT implementations. These experiments give us a comprehensive overview of how different approaches to influence estimation work in GBDT models. We find BoostIn is an efficient influence-estimation method for GBDTs that performs equally well or better than existing work while being four orders of magnitude faster. Our evaluation also suggests the gold-standard approach of leave-one-out (LOO) retraining consistently identifies the single-most influential training example but performs poorly at finding the most influential set of training examples for a given target prediction.
引用
收藏
页数:48
相关论文
共 89 条
[71]   Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization [J].
Selvaraju, Ramprasaath R. ;
Cogswell, Michael ;
Das, Abhishek ;
Vedantam, Ramakrishna ;
Parikh, Devi ;
Batra, Dhruv .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :618-626
[72]  
Shapley L. S., 2016, Contributions to the Theory of Games, P307, DOI [DOI 10.1515/9781400881970-018, 10.1515/9781400881970-018]
[73]  
Sharchilev B., 2018, INT C MACHINE LEARNI, P4577
[74]  
Steinhardt J, 2017, ADV NEUR IN, V30
[75]   Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records [J].
Strack, Beata ;
DeShazo, Jonathan P. ;
Gennings, Chris ;
Olmo, Juan L. ;
Ventura, Sebastian ;
Cios, Krzysztof J. ;
Clore, John N. .
BIOMED RESEARCH INTERNATIONAL, 2014, 2014
[76]   SuperTML: Two-Dimensional Word Embedding for the Precognition on Structured Tabular Data [J].
Sun, Baohua ;
Yang, Lin ;
Zhang, Wenhan ;
Lin, Michael ;
Dong, Patrick ;
Young, Charles ;
Dong, Jason .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2019), 2019, :2973-2981
[77]  
Sundararajan M., 2020, PMLR, P9269
[78]  
Suzanne, 2018, CDC data: Nutrition, physical activity, & obesity
[79]  
Swayamdipta S, 2020, PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), P9275
[80]  
Tan Sarah, 2020, P ACM IMS INT C FDN