Valuing Training Data via Causal Inference for In-Context Learning

被引:1
作者
Zhou, Xiaoling [1 ]
Ye, Wei [1 ]
Lee, Zhemg [2 ]
Zou, Lei [3 ]
Zhang, Shikun [1 ]
机构
[1] Peking Univ, Natl Engn Res Ctr Software Engn, Beijing 100871, Peoples R China
[2] Tianjin Univ, Tianjin 300072, Peoples R China
[3] Peking Univ, Wangxuan Inst Comp Technol, Beijing 100871, Peoples R China
关键词
Training; Training data; Cost accounting; Robustness; Data models; Reviews; Linear regression; Semantics; Electronic mail; Computational efficiency; In-context learning; data valuation; causal inference; average marginal effect; elastic net regression; REGULARIZATION;
D O I
10.1109/TKDE.2025.3546761
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In-context learning (ICL) empowers large pre-trained language models (PLMs) to predict outcomes for unseen inputs without parameter updates. However, the efficacy of ICL heavily relies on the choice of demonstration examples. Randomly selecting from the training set frequently leads to inconsistent performance. Addressing this challenge, this study takes a novel approach by focusing on training data valuation through causal inference. Specifically, we introduce the concept of average marginal effect (AME) to quantify the contribution of individual training samples to ICL performance, encompassing both its generalization and robustness. Drawing inspiration from multiple treatment effects and randomized experiments, we initially sample diverse training subsets to construct prompts and evaluate the ICL performance based on these prompts. Subsequently, we employ Elastic Net regression to collectively estimate the AME values for all training data, considering subset compositions and inference performance. Ultimately, we prioritize samples with the highest values to prompt the inference of the test data. Across various tasks and with seven PLMs ranging in size from 0.8B to 33B, our approach consistently achieves state-of-the-art performance. Particularly, it outperforms Vanilla ICL and the best-performing baseline by an average of 14.1% and 5.2%, respectively. Moreover, prioritizing the most valuable samples for prompting leads to a significant enhancement in performance stability and robustness across various learning scenarios. Impressively, the valuable samples exhibit transferability across diverse PLMs and generalize well to out-of-distribution tasks.
引用
收藏
页码:3824 / 3840
页数:17
相关论文
共 85 条
[1]  
Agarwal R., 2024, Advances in Neural Information Processing Systems, P76930
[2]  
Agrawal S, 2023, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), P8857
[3]  
Angrist JD, 2009, MOSTLY HARMLESS ECONOMETRICS: AN EMPIRICISTS COMPANION, P1
[4]  
Black Sid, 2021, GPT-neo: Large scale autoregressive language modeling with mesh-tensorflow, DOI [DOI 10.5281/ZENODO, 10.5281/zenodo,5297715]
[5]  
Brown TB, 2020, ADV NEUR IN, V33
[6]  
Chang TY, 2023, PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, P8123
[7]  
Clark C, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P2924
[8]  
Cohan A, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P3586
[9]  
Das R, 2021, 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), P9594
[10]  
Davezies L, 2024, Arxiv, DOI arXiv:2105.00879