Adapting and Evaluating Influence-Estimation Methods for Gradient-Boosted Decision Trees

被引：0

作者：

Brophy, Jonathan ^{[1
]}

Hammoudeh, Zayd ^{[1
]}

Lowd, Daniel ^{[1
]}

机构：

[1] Univ Oregon, Dept Comp & Informat Sci, Eugene, OR 97403 USA

来源：

JOURNAL OF MACHINE LEARNING RESEARCH | 2023年 / 24卷

关键词：

training data influence; gradient-boosted decision trees; instance attribution; influence functions; TracIn; PERFORMANCE; SELECTION;

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Influence estimation analyzes how changes to the training data can lead to different model predictions; this analysis can help us better understand these predictions, the models making those predictions, and the data sets they're trained on. However, most influenceestimation techniques are designed for deep learning models with continuous parameters. Gradient-boosted decision trees (GBDTs) are a powerful and widely-used class of models; however, these models are black boxes with opaque decision-making processes. In the pursuit of better understanding GBDT predictions and generally improving these models, we adapt recent and popular influence-estimation methods designed for deep learning models to GBDTs. Specifically, we adapt representer-point methods and TracIn, denoting our new methods TREX and BoostIn, respectively; source code is available at https://github.com/jjbrophy47/tree_influence. We compare these methods to LeafInfluence and other baselines using 5 different evaluation measures on 22 real-world data sets with 4 popular GBDT implementations. These experiments give us a comprehensive overview of how different approaches to influence estimation work in GBDT models. We find BoostIn is an efficient influence-estimation method for GBDTs that performs equally well or better than existing work while being four orders of magnitude faster. Our evaluation also suggests the gold-standard approach of leave-one-out (LOO) retraining consistently identifies the single-most influential training example but performs poorly at finding the most influential set of training examples for a given target prediction.

引用

页数：48

共 89 条

[51]

Levine Alexander, 2020, P 34 INT C NEURAL IN

[52] Influence Selection for Active Learning [J].

Liu, Zhuoming ;

Ding, Hao ;

Zhong, Huaping ;

Li, Weijia ;

Dai, Jifeng ;

He, Conghui .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9254-9263

[53]

Lundberg SM, 2017, ADV NEUR IN, V30

[54] Fifty years of pulsar candidate selection: from simple filters to a new principled real-time classification approach [J].

Lyon, R. J. ;

Stappers, B. W. ;

Cooper, S. ;

Brooke, J. M. ;

Knowles, J. D. .

MONTHLY NOTICES OF THE ROYAL ASTRONOMICAL SOCIETY, 2016, 459 (01) :1104-1123

[55]

Lundberg SM, 2019, Arxiv, DOI arXiv:1802.03888

[56]

Malinin Andrey, 2021, P 9 INT C LEARNING R

[57] A data-driven approach to predict the success of bank telemarketing [J].

Moro, Sergio ;

Cortez, Paulo ;

Rita, Paulo .

DECISION SUPPORT SYSTEMS, 2014, 62 :22-31

[58]

Ofer Dan, 2017, COMPAS Recidivism Racial Bias

[59]

Oh Sejoon, 2021, P 30 ACM INT C INFOR

[60]

Pedregosa F, 2011, J MACH LEARN RES, V12, P2825

← 1 2 3 4 5 6 7 8 9 →