AdaDS: Adaptive data selection for accelerating pre-trained language model knowledge distillation

被引：1

作者：

Zhou, Qinhong ^{[1
]}

Li, Peng ^{[2
]}

Liu, Yang ^{[2
]}

Guan, Yuyang ^{[3
]}

Xing, Qizhou ^{[3
]}

Chen, Ming ^{[3
]}

Sun, Maosong ^{[1
]}

Liu, Yang ^{[2
]}

机构：

[1] Tsinghua Univ, Dept Comp Sci & Technol, Beijing, Peoples R China

[2] Tsinghua Univ, Inst AI Ind Res, Beijing, Peoples R China

[3] Beijing Sinovoice Technol Co Ltd, Beijing, Peoples R China

来源：

AI OPEN | 2023年 / 4卷

关键词：

Knowledge distillation; Pre-trained language model; Active learning;

D O I：

10.1016/j.aiopen.2023.08.005

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Knowledge distillation (KD) is a widely used method for transferring knowledge from large teacher models to computationally efficient student models. Unfortunately, the computational cost of KD becomes unaffordable as pre -trained language models (PLMs) grow larger. Computing KD loss on only part of the training set is a promising way to accelerate KD. However, existing works heuristically leverage only one static data selection strategy during the KD process, demonstrating inconsistent improvements across different distillation scenarios. In this work, we conduct a thorough study on various typical data selection strategies for KD, and show that this problem is due to the fact that the best data selection strategy is specific to various factors, including task, selected data size, and training stage. To automatically adapt to these factors, we propose a framework named AdaDS to learn to choose the data selection strategy adaptively during the KD process. Experimental results show that our proposed method is effective for various tasks and selected data sizes under both fine-tuning and pre -training stages, achieving comparable performance to DistilBERT with only 10% amount of queries to the teacher model.

引用

页码：56 / 63

页数：8

共 46 条

[1]

[Anonymous], 2012, KR 2012

[2]

Arthur D, 2007, PROCEEDINGS OF THE EIGHTEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, P1027

[3]

Attenberg J, 2010, LECT NOTES ARTIF INT, V6321, P40, DOI 10.1007/978-3-642-15880-3_9

[4]

Bentivogli Luisa, 2009, TAC

[5]

Brown TB, 2020, ADV NEUR IN, V33

[6]

Cer D., 2017, SEMEVAL 2017

[7]

Chen Zihang, 2017, Quora question pairs

[8]

Chowdhery A, 2022, Arxiv, DOI [arXiv:2204.02311, 10.48550/arXiv.2204.02311, DOI 10.48550/ARXIV.2204.02311]

[9]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[10]

Dolan Bill, 2005, IWP 2005

← 1 2 3 4 5 →