Compressed-Transformer: Distilling Knowledge from Transformer for Neural Machine Translation

被引:1
作者
Chen, Yuan [1 ]
Rong, Pan [1 ]
机构
[1] Sun Yat Sen Univ, Sch Data & Comp Sci, Guangzhou, Peoples R China
来源
2020 4TH INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL, NLPIR 2020 | 2020年
关键词
Compressed-Transformer; neural machine translation; knowledge distillation; factorize parameters; stage-wise distillation strategy;
D O I
10.1145/3443279.3443302
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recently, Transformer has achieved state-of-the-art performance in neural machine translation. However, the number of parameters in Transformer is so large that it needs to be compressed before deployed and executed on resource-restricted devices. In this paper, we propose a compressed version of Transformer called Compressed-Transformer. We introduce two techniques, factorizing parameters and block reduction, to compress Transformer model. Consequently, the number of parameters can be reduced by more than 50%. We exploit a stage-wise distillation strategy with the temperature dynamically adjusted in knowledge distillation practice to transfer knowledge from base Transformer (teacher) to Compressed-Transformer (student). A Chinese-to-English (Zh -> En) dataset of United Nations Parallel Corpus and a German-to-English (De -> En) dataset of Multi30K are used, and the experimental results show that our compressed model achieves BLEU score only slightly lower than uncompressed teacher model. Specially, when the number of parameters is reduced by 59.3%, the student model can achieve BLEU score of 40.69, only 1.64 lower than that of the teacher model, and the inference speed is improved by 17% on Zh -> En dataset. The experiments on De -> En dataset also achieve the similar results.
引用
收藏
页码:131 / 137
页数:7
相关论文
共 24 条
[1]  
Aguilar G, 2020, Arxiv, DOI arXiv:1910.03723
[2]  
[Anonymous], 2006, P 12 ACM SIGKDD INT
[3]  
Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, 10.48550/arXiv.1409.0473, DOI 10.48550/ARXIV.1409.0473]
[4]  
Chen YC, 2020, Arxiv, DOI arXiv:1911.03829
[5]   What does BERT look at? An Analysis of BERT's Attention [J].
Clark, Kevin ;
Khandelwal, Urvashi ;
Levy, Omer ;
Manning, Christopher D. .
BLACKBOXNLP WORKSHOP ON ANALYZING AND INTERPRETING NEURAL NETWORKS FOR NLP AT ACL 2019, 2019, :276-286
[6]  
Denil M, 2014, Arxiv, DOI arXiv:1306.0543
[7]  
Desmond E., 2016, P 5 WORKSHOP VISION
[8]  
Gulcehre C., 2014, PROC C EMPIRICAL MET, P1724
[9]  
Hinton G., 2015, NEURIPS 2014 WORKSH, P1
[10]  
Jiao XQ, 2020, Arxiv, DOI arXiv:1909.10351