Combining Curriculum Learning and Knowledge Distillation for Dialogue Generation

被引:0
作者
Zhu, Qingqing [1 ]
Chen, Xiuying [2 ]
Wu, Pengfei [1 ]
Liu, JunFei [1 ]
Zhao, Dongyan [3 ]
机构
[1] Peking Univ, Sch Software & Microelect, Beijing, Peoples R China
[2] Peking Univ, Ctr Data Sci, AAIS, Beijing, Peoples R China
[3] Peking Univ, Wangxuan Inst Comp Technol, Beijing, Peoples R China
来源
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021 | 2021年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Curriculum learning, a machine training strategy that feeds training instances to the model from easy to hard, has been proven to facilitate the dialogue generation task. Meanwhile, knowledge distillation, a knowledge transformation methodology among teachers and students networks can yield significant performance boost for student models. Hence, in this paper, we introduce a combination of curriculum learning and knowledge distillation for efficient dialogue generation models, where curriculum learning can help knowledge distillation from data and model aspects. To start with, from the data aspect, we cluster the training cases according to their complexity, which is calculated by various types of features such as sentence length and coherence between dialog pairs. Furthermore, we employ an adversarial training strategy to identify the complexity of cases from model level. The intuition is that, if a discriminator can tell the generated response is from the teacher or the student, then the case is difficult that the student model has not adapted to yet. Finally, we use self-paced learning, which is an extension to curriculum learning to assign weights for distillation. In conclusion, we arrange a hierarchical curriculum based on the above two aspects for the student model under the guidance from the teacher model. Experimental results demonstrate that our methods achieve improvements compared with competitive baselines.
引用
收藏
页码:1284 / 1295
页数:12
相关论文
共 46 条
[1]  
[Anonymous], 2015, ACS SYM SER
[2]  
Arjovsky M, 2017, Arxiv, DOI [arXiv:1701.07875, 10.48550/arXiv.1701.07875]
[3]  
Ba LJ, 2014, ADV NEUR IN, V27
[4]  
Baheti A, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P3970
[5]  
Bengio Y., 2009, P 26 ANN INT C MACH, P41, DOI DOI 10.1145/1553374.1553380
[6]  
Cai HY, 2020, AAAI CONF ARTIF INTE, V34, P7472
[7]  
Cho K., 2014, ARXIV14061078, DOI [DOI 10.3115/V1/D14-1179, 10.3115/v1/D14-1179]
[8]  
Eppe M, 2019, J IEEE I C DEVELOP L, P183, DOI [10.1109/devlrn.2019.8850721, 10.1109/DEVLRN.2019.8850721]
[9]  
Feng JZ, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P3805
[10]   EQUIVALENCE OF WEIGHTED KAPPA AND INTRACLASS CORRELATION COEFFICIENT AS MEASURES OF RELIABILITY [J].
FLEISS, JL ;
COHEN, J .
EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, 1973, 33 (03) :613-619