Selecting Informative Contexts Improves Language Model Fine-tuning

被引:0
作者
Antonello, Richard [1 ]
Beckage, Nicole M. [2 ]
Turek, Javier S. [2 ]
Huth, Alexander G. [1 ]
机构
[1] UT Austin, Austin, TX 78712 USA
[2] Intel Labs, Hillsboro, OR USA
来源
59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021) | 2021年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Language model fine-tuning is essential for modern natural language processing, but is computationally expensive and timeconsuming. Further, the effectiveness of finetuning is limited by the inclusion of training examples that negatively affect performance. Here we present a general fine-tuning method that we call information gain filtration for improving the overall training efficiency and final performance of language model fine-tuning. We define the information gain of an example as the improvement on a validation metric after training on that example. A secondary learner is then trained to approximate this quantity. During fine-tuning, this learner selects informative examples and skips uninformative ones. We show that our method has consistent improvement across datasets, finetuning tasks, and language model architectures. For example, we achieve a median perplexity of 54.0 on a books dataset compared to 57.3 for standard fine-tuning. We present statistical evidence that offers insight into the improvements of our method over standard finetuning. The generality of our method leads us to propose a new paradigm for language model fine-tuning- we encourage researchers to release pretrained secondary learners on common corpora to promote efficient and effective fine-tuning, thereby improving the performance and reducing the overall energy footprint of language model fine-tuning.
引用
收藏
页码:1072 / 1085
页数:14
相关论文
共 24 条
[1]  
Bottou L., 1991, P NEURO NIMES, V91, pEC2
[2]  
Devlin J, 2018, ARXIV
[3]  
Dodge Jesse, 2020, Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping
[4]  
Gage P., 1994, The C Users Journal, V12, P23, DOI DOI 10.5555/177910.177914
[5]  
Howard J, 2018, PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, P328
[6]   Natural speech reveals the semantic maps that tile human cerebral cortex [J].
Huth, Alexander G. ;
de Heer, Wendy A. ;
Griffiths, Thomas L. ;
Theunissen, Frederic E. ;
Gallant, Jack L. .
NATURE, 2016, 532 (7600) :453-+
[7]  
Kingma DP., 2014, P 2 INT C LEARN REPR
[8]  
Lee Cheolhyoung, 2020, ICLR
[9]  
Liu NF, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P1073
[10]  
Merity S., 2017, 5 INT C LEARN REPR