Automatic Corpus Extension for Data-driven Natural Language Generation

被引:0
作者
Manishina, Elena [1 ]
Jabaian, Bassam [1 ]
Huet, Stephane [1 ]
Lefevre, Fabrice [1 ]
机构
[1] Univ Avignon, LIA CERI, Avignon, France
来源
LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2016年
关键词
corpus building; natural language generation; automatic paraphrasing;
D O I
暂无
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
As data-driven approaches started to make their way into the Natural Language Generation (NLG) domain, the need for automation of corpus building and extension became apparent. Corpus creation and extension in data-driven NLG domain traditionally involved manual paraphrasing performed by either a group of experts or with resort to crowd-sourcing. Building the training corpora manually is a costly enterprise which requires a lot of time and human resources. We propose to automate the process of corpus extension by integrating automatically obtained synonyms and paraphrases. Our methodology allowed us to significantly increase the size of the training corpus and its level of variability (the number of distinct tokens and specific syntactic structures). Our extension solutions are fully automatic and require only some initial validation. The human evaluation results confirm that in many cases native users favor the outputs of the model built on the extended corpus.
引用
收藏
页码:3624 / 3631
页数:8
相关论文
共 21 条
[1]  
[Anonymous], P 10 MACHINE TRANSLA
[2]  
[Anonymous], 2005, M ASS COMP LING, DOI [10.3115/1219840.1219914, DOI 10.3115/1219840.1219914]
[3]  
[Anonymous], 2011, P 6 WORKSHOP STAT MA
[4]  
[Anonymous], 2009, Association for Computational Linguistics
[5]  
[Anonymous], 1999, P 37 ANN M ASS COMP, DOI DOI 10.1115/10146781014760
[6]  
[Anonymous], 1998, WORDNET
[7]  
Barzilay R, 2001, 39TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, P50
[8]  
Fike JH, 2014, COMPEND BIOENERG PLA, P16
[9]  
Freitag D., 2005, P 9 C COMP NAT LANG, P25
[10]   A unified framework for translation and understanding allowing discriminative joint decoding for multilingual speech semantic interpretation [J].
Jabaian, Bassam ;
Lefevre, Fabrice ;
Besacier, Laurent .
COMPUTER SPEECH AND LANGUAGE, 2016, 35 :185-199