AdaDS: Adaptive data selection for accelerating pre-trained language model knowledge distillation

被引:1
作者
Zhou, Qinhong [1 ]
Li, Peng [2 ]
Liu, Yang [2 ]
Guan, Yuyang [3 ]
Xing, Qizhou [3 ]
Chen, Ming [3 ]
Sun, Maosong [1 ]
Liu, Yang [2 ]
机构
[1] Tsinghua Univ, Dept Comp Sci & Technol, Beijing, Peoples R China
[2] Tsinghua Univ, Inst AI Ind Res, Beijing, Peoples R China
[3] Beijing Sinovoice Technol Co Ltd, Beijing, Peoples R China
来源
AI OPEN | 2023年 / 4卷
关键词
Knowledge distillation; Pre-trained language model; Active learning;
D O I
10.1016/j.aiopen.2023.08.005
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Knowledge distillation (KD) is a widely used method for transferring knowledge from large teacher models to computationally efficient student models. Unfortunately, the computational cost of KD becomes unaffordable as pre -trained language models (PLMs) grow larger. Computing KD loss on only part of the training set is a promising way to accelerate KD. However, existing works heuristically leverage only one static data selection strategy during the KD process, demonstrating inconsistent improvements across different distillation scenarios. In this work, we conduct a thorough study on various typical data selection strategies for KD, and show that this problem is due to the fact that the best data selection strategy is specific to various factors, including task, selected data size, and training stage. To automatically adapt to these factors, we propose a framework named AdaDS to learn to choose the data selection strategy adaptively during the KD process. Experimental results show that our proposed method is effective for various tasks and selected data sizes under both fine-tuning and pre -training stages, achieving comparable performance to DistilBERT with only 10% amount of queries to the teacher model.
引用
收藏
页码:56 / 63
页数:8
相关论文
共 46 条
  • [1] [Anonymous], 2012, KR 2012
  • [2] Arthur D, 2007, PROCEEDINGS OF THE EIGHTEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, P1027
  • [3] Attenberg J, 2010, LECT NOTES ARTIF INT, V6321, P40, DOI 10.1007/978-3-642-15880-3_9
  • [4] Bentivogli Luisa, 2009, TAC
  • [5] Brown Tom B., 2020, P ADV NEUR INF PROC
  • [6] Cer Daniel, 2017, SemEval-2017
  • [7] Chen Zihang, 2017, Quora question pairs
  • [8] Chowdhery A, 2022, Arxiv, DOI [arXiv:2204.02311, 10.48550/arXiv.2204.02311, DOI 10.48550/ARXIV.2204.02311]
  • [9] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [10] Dolan Bill, 2005, IWP 2005