Reward estimation with scheduled knowledge distillation for dialogue policy learning

被引:2
作者
Qiu, Junyan [1 ]
Zhang, Haidong [2 ]
Yang, Yiping [2 ]
机构
[1] Univ Chinese Acad Sci, Beijing, Peoples R China
[2] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China
关键词
Reinforcement learning; dialogue policy learning; curriculum learning; knowledge distillation;
D O I
10.1080/09540091.2023.2174078
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Formulating dialogue policy as a reinforcement learning (RL) task enables a dialogue system to act optimally by interacting with humans. However, typical RL-based methods normally suffer from challenges such as sparse and delayed reward problems. Besides, with user goal unavailable in real scenarios, the reward estimator is unable to generate reward reflecting action validity and task completion. Those issues may slow down and degrade the policy learning significantly. In this paper, we present a novel scheduled knowledge distillation framework for dialogue policy learning, which trains a compact student reward estimator by distilling the prior knowledge of user goals from a large teacher model. To further improve the stability of dialogue policy learning, we propose to leverage self-paced learning to arrange meaningful training order for the student reward estimator. Comprehensive experiments on Microsoft Dialogue Challenge and MultiWOZ datasets indicate that our approach significantly accelerates the learning speed, and the task-completion success rate can be improved from 0.47%similar to 9.01% compared with several strong baselines.
引用
收藏
页数:28
相关论文
共 57 条
[1]  
Bengio Y, 2009, P 26 ANN INT C MACH, P41, DOI [10.1145/1553374.1553380, DOI 10.1145/1553374.1553380]
[2]  
Budzianowski P, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P5016
[3]  
Chu GY, 2021, PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, P2300
[4]  
Chung J., 2014, EMPIRICAL EVALUATION
[5]   Generative Adversarial Networks An overview [J].
Creswell, Antonia ;
White, Tom ;
Dumoulin, Vincent ;
Arulkumaran, Kai ;
Sengupta, Biswa ;
Bharath, Anil A. .
IEEE SIGNAL PROCESSING MAGAZINE, 2018, 35 (01) :53-65
[6]  
Devlin J, 2019, Arxiv, DOI [arXiv:1810.04805, 10.48550/arxiv.1810.04805]
[7]   Towards End-to-End Reinforcement Learning of Dialogue Agents for Information Access [J].
Dhingra, Bhuwan ;
Li, Lihong ;
Li, Xiujun ;
Gao, Jianfeng ;
Chen, Yun-Nung ;
Ahmed, Faisal ;
Deng, Li .
PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, :484-495
[8]   Variational Denoising Autoencoders and Least-Squares Policy Iteration for Statistical Dialogue Managers [J].
Diakoloukas, Vassilios ;
Lygerakis, Fotios ;
Lagoudakis, Michail G. ;
Kotti, Margarita .
IEEE SIGNAL PROCESSING LETTERS, 2020, 27 :960-964
[9]   Dual Graph Convolutional Networks with Transformer and Curriculum Learning for Image Captioning [J].
Dong, Xinzhi ;
Long, Chengjiang ;
Xu, Wenju ;
Xiao, Chunxia .
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, :2615-2624
[10]  
El-Bouri R., 2020, INT C MACHINE LEARNI, P2848, DOI DOI 10.5555/3524938.3525205