Ensemble Compressed Language Model Based on Knowledge Distillation and Multi-Task Learning

被引：0

作者：

Xiang, Kun ^{[1
]}

Fujii, Akihiro ^{[1
]}

机构：

[1] Hosei Univ, Dept Sci & Engn, Tokyo, Japan

来源：

2022 7TH INTERNATIONAL CONFERENCE ON BUSINESS AND INDUSTRIAL RESEARCH (ICBIR2022) | 2022年

关键词：

Deep Neural Network; Knowledge Distillation; Model Compression; Multi-Task Learning; LSTM;

D O I：

10.1109/ICBIR54589.2022.9786508

中图分类号：

F [经济];

学科分类号：

02 ;

摘要：

The success of pre-trained language representation models such as BERT benefits from their "overparameterized" nature, resulting in training time consuming, high computational complexity and superior requirement of devices. Among the variety of model compression and acceleration techniques, Knowledge Distillation(KD) has attracted extensive attention for compressing pre-trained language models. However, the major two challenges for KD are: (i)Transfer more knowledge from the teacher model to student model without scarifying accuracy while accelerating. (ii)Higher training speed of the lightweight model is accompanied by the risk of overfitting due to the noise influence. To address these problems, we propose a novel model based on knowledge distillation, called Theseus-BERT Guided Distill CNN(TBG-disCNN). BERT-of-Theseus [1] is employed as the teacher model and CNN as the student model. Aiming at the inherent noise problem, we propose coordinated CNN-BiLSTM as a parameter-sharing layer for Multi-Task Learning (MTL), in order to capture both regional and long-term dependence information. Our approach has approximately good performance as BERT-base and teacher model with 12x and 281x speedup of inference and 19.58x and 8.94x fewer parameters usage, respectively.

引用

页码：72 / 77

页数：6

共 29 条

[1] Aguilar Gustavo, 2019, KNOWLEDGE DISTILLATI
[2] Multitask learning
Caruana, R
[J]. MACHINE LEARNING, 1997, 28 (01) : 41 - 75
[3] On the generalization ability of on-line learning algorithms
Cesa-Bianchi, N
Conconi, A
Gentile, C
[J]. IEEE TRANSACTIONS ON INFORMATION THEORY, 2004, 50 (09) : 2050 - 2057
[4] Clark K, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P5931
[5] Devlin J, 2019, Arxiv, DOI arXiv:1810.04805
[6] Evgeniou Theodoros, 2004, P 10 ACM SIGKDD INT, P109
[7] EIE: Efficient Inference Engine on Compressed Deep Neural Network
Han, Song
Liu, Xingyu
Mao, Huizi
Pu, Jing
Pedram, Ardavan
Horowitz, Mark A.
Dally, William J.
[J]. 2016 ACM/IEEE 43RD ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2016, : 243 - 254
[8] Hinton G., 2015, ARXIV, V2
[9] Jiao Xiaoqi., 2019, CoRR
[10] Li J., 2014, P INT

← 1 2 3 →