Ensemble Compressed Language Model Based on Knowledge Distillation and Multi-Task Learning

被引:0
作者
Xiang, Kun [1 ]
Fujii, Akihiro [1 ]
机构
[1] Hosei Univ, Dept Sci & Engn, Tokyo, Japan
来源
2022 7TH INTERNATIONAL CONFERENCE ON BUSINESS AND INDUSTRIAL RESEARCH (ICBIR2022) | 2022年
关键词
Deep Neural Network; Knowledge Distillation; Model Compression; Multi-Task Learning; LSTM;
D O I
10.1109/ICBIR54589.2022.9786508
中图分类号
F [经济];
学科分类号
02 ;
摘要
The success of pre-trained language representation models such as BERT benefits from their "overparameterized" nature, resulting in training time consuming, high computational complexity and superior requirement of devices. Among the variety of model compression and acceleration techniques, Knowledge Distillation(KD) has attracted extensive attention for compressing pre-trained language models. However, the major two challenges for KD are: (i)Transfer more knowledge from the teacher model to student model without scarifying accuracy while accelerating. (ii)Higher training speed of the lightweight model is accompanied by the risk of overfitting due to the noise influence. To address these problems, we propose a novel model based on knowledge distillation, called Theseus-BERT Guided Distill CNN(TBG-disCNN). BERT-of-Theseus [1] is employed as the teacher model and CNN as the student model. Aiming at the inherent noise problem, we propose coordinated CNN-BiLSTM as a parameter-sharing layer for Multi-Task Learning (MTL), in order to capture both regional and long-term dependence information. Our approach has approximately good performance as BERT-base and teacher model with 12x and 281x speedup of inference and 19.58x and 8.94x fewer parameters usage, respectively.
引用
收藏
页码:72 / 77
页数:6
相关论文
共 29 条
  • [1] Aguilar Gustavo, 2019, KNOWLEDGE DISTILLATI
  • [2] Multitask learning
    Caruana, R
    [J]. MACHINE LEARNING, 1997, 28 (01) : 41 - 75
  • [3] On the generalization ability of on-line learning algorithms
    Cesa-Bianchi, N
    Conconi, A
    Gentile, C
    [J]. IEEE TRANSACTIONS ON INFORMATION THEORY, 2004, 50 (09) : 2050 - 2057
  • [4] Clark K, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P5931
  • [5] Devlin J, 2019, Arxiv, DOI arXiv:1810.04805
  • [6] Evgeniou Theodoros, 2004, P 10 ACM SIGKDD INT, P109
  • [7] EIE: Efficient Inference Engine on Compressed Deep Neural Network
    Han, Song
    Liu, Xingyu
    Mao, Huizi
    Pu, Jing
    Pedram, Ardavan
    Horowitz, Mark A.
    Dally, William J.
    [J]. 2016 ACM/IEEE 43RD ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2016, : 243 - 254
  • [8] Hinton G., 2015, ARXIV, V2
  • [9] Jiao Xiaoqi., 2019, CoRR
  • [10] Li J., 2014, P INT