Dual Knowledge Distillation for neural machine translation

被引：3

作者：

Wan, Yuxian ^{[1
]}

Zhang, Wenlin ^{[1
]}

Li, Zhen ^{[1
]}

Zhang, Hao ^{[1
]}

Li, Yanxia ^{[2
]}

机构：

[1] Univ Informat Engn, Sch Informat Syst Engn, Zhengzhou 450000, Peoples R China

[2] Univ Informat Engn, Basic Dept, Zhengzhou 450000, Peoples R China

来源：

COMPUTER SPEECH AND LANGUAGE | 2024年 / 84卷

基金：

中国国家自然科学基金;

关键词：

Knowledge distillation; k Nearest Neighbor Knowledge Distillation; Low-resource; Monolingual data;

D O I：

10.1016/j.csl.2023.101583

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Existing knowledge distillation methods use large amount of bilingual data and focus on mining the corresponding knowledge distribution between the source language and the target language. However, for some languages, bilingual data is not abundant. In this paper, to make better use of both monolingual and limited bilingual data, we propose a new knowledge distillation method called Dual Knowledge Distillation (DKD). For monolingual data, we use a self-distillation strategy which combines self-training and knowledge distillation for the encoder to extract more consistent monolingual representation. For bilingual data, on top of the k Nearest Neighbor Knowledge Distillation (kNN-KD) method, a similar self-distillation strategy is adopted as a consistency regularization method to force the decoder to produce consistent output. Experiments on standard datasets, multi-domain translation datasets, and low-resource datasets show that DKD achieves consistent improvements over state-of-the-art baselines including kNN-KD.

引用

页数：13

共 41 条

[1]

Aharoni Roee, 2020, P 58 ANN M ASS COMP, P7747

[2]

[Anonymous], 2007, P 24 INT C MACH LEAR, DOI DOI 10.1145/1273496.1273592

[3]

Artetxe M., 2018, INT C LEARN REPR, P1, DOI DOI 10.18653/V1/D18-1399

[4] Representation Learning: A Review and New Perspectives [J].

Bengio, Yoshua ;

Courville, Aaron ;

Vincent, Pascal .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (08) :1798-1828

[5] Emerging Properties in Self-Supervised Vision Transformers [J].

Caron, Mathilde ;

Touvron, Hugo ;

Misra, Ishan ;

Jegou, Herve ;

Mairal, Julien ;

Bojanowski, Piotr ;

Joulin, Armand .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9630-9640

[6]

Chen Y.-C., 2019, ANN M ASS COMP LING

[7]

Chen Yen-Chun, 2020, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, P7893, DOI DOI 10.18653/V1/2020.ACL-MAIN.705

[8]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[9]

Goyal P, 2018, Arxiv, DOI arXiv:1706.02677

[10]

Grill J., 2020, ADV NEURAL INFORM PR, V33, P21271

← 1 2 3 4 5 →