Grabbing the Long Tail: A data normalization method for diverse and informative dialogue generation

被引:11
作者
Zhan, Zhiqiang [1 ]
Zhao, Jianyu [2 ]
Zhang, Yang [1 ]
Gong, Jiangtao [1 ]
Wang, Qianying [1 ]
Shen, Qi [3 ]
Zhang, Liuxin [1 ]
机构
[1] Lenovo Res, Smart Educ Lab, Beijing, Peoples R China
[2] Lenovo Res, Al Lab, Beijing, Peoples R China
[3] Beijing Union Univ, Beijing, Peoples R China
关键词
Dialogue generation; Long Tail; Data normalization; Diversity; Informativeness;
D O I
10.1016/j.neucom.2021.07.039
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent neural models have shown significant progress in dialogue generation. Among those models, most of them are based on language models, yielding the generation word by word according to the previous context. Due to the inherent mechanism in language models, as well as the most frequently used cross-entropy function (making the distribution of generations approximate that of training data continuously), trained generation models inevitably tend to generate frequent words in training datasets, leading to low diversity and poor informativeness issues. By investigating a few mainstream dialogue generation models, we find that the probable cause is the intrinsic Long Tail Phenomenon in linguistics. To address these issues of low diversity and poor informativeness, we explore and analyze a large corpus from Wikipedia, and then propose an efficient frequency-based data normalization method, i.e., Log Normalization. Furthermore, we explore another two methods, Mutual Normalization and Log-Mutual Normalization, to eliminate the mutual information effect. In order to validate the effectiveness of the proposed methods, we conduct extensive experiments on three datasets with different subjects, includ-ing social media, film subtitles, and online customer service. Compared with the vanilla transformers, generation models augmented with our proposed methods achieve significant improvements in gener-ated responses, in terms of both diversity and informativeness. Specifically, the unigram and bigram diversity in the responses are improved by 8.5%-14.1% and 19.7%-25.8% on the three datasets, respec-tively. The informativeness (defined as amounts of nouns and verbs) is increased by 13.1%-31.0% and 30.4%-59.0%, respectively. Moreover, our methods can be adapted to new generation models efficiently and effectively, with their model-agnostic characteristics. (c) 2021 Elsevier B.V. All rights reserved.
引用
收藏
页码:374 / 384
页数:11
相关论文
共 37 条
[1]  
[Anonymous], 2017, P C EMP METH NAT LAN
[2]  
[Anonymous], 2011, Proceedings of the Conference on Empirical Methods in Natural Language Processing. EMNLP'11
[3]  
Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
[4]  
Bao SQ, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), P85
[5]  
Cho K., 2014, ARXIV14061078, DOI 10.3115/v1/D14-1179
[6]  
Chung J., ABS14123555
[7]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[8]  
Gao X, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P1229
[9]  
Graves A, 2012, STUD COMPUT INTELL, V385, P1, DOI [10.1162/neco.1997.9.8.1735, 10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]
[10]  
Gu JT, 2016, PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, P1631