Creating Training Corpora for NLG Micro-Planning

被引:277
作者
Gardent, Claire [1 ]
Shimorina, Anastasia [1 ]
Narayan, Shashi [2 ]
Perez-Beltrachini, Laura [2 ]
机构
[1] CNRS, UMR 7503, LORIA, F-54500 Vandoeuvre Les Nancy, France
[2] Univ Edinburgh, Sch Informat, 10 Crichton St, Edinburgh EH8 9AB, Midlothian, Scotland
来源
PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1 | 2017年
基金
欧盟地平线“2020”;
关键词
D O I
10.18653/v1/P17-1017
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
In this paper, we present a novel framework for semi-automatically creating linguistically challenging microplanning data-to-text corpora from existing Knowledge Bases. Because our method pairs data of varying size and shape with texts ranging from simple clauses to short texts, a dataset created using this framework provides a challenging benchmark for microplanning. Another feature of this framework is that it can be applied to any large scale knowledge base and can therefore be used to train and learn KB verbalisers. We apply our framework to DBpedia data and compare the resulting dataset with Wen et al. (2016)'s. We show that while Wen et al.'s dataset is more than twice larger than ours, it is less diverse both in terms of input and in terms of text. We thus propose our corpus generation framework as a novel method for creating challenging data sets from which NLG models can be learned which are capable of handling the complex interactions occurring during in micro-planning between lexicalisation, aggregation, surface realisation, referring expression generation and sentence segmentation. To encourage researchers to take up this challenge, we recently made available a dataset created using this framework in the context of the WEBNLG shared task.
引用
收藏
页码:179 / 188
页数:10
相关论文
共 18 条
[1]  
[Anonymous], P ICSLP
[2]  
[Anonymous], P ACL SYST DEM
[3]  
[Anonymous], 2015, P EMNLP
[4]  
[Anonymous], 2002, P 40 ANN M ASS COMP
[5]  
Banarescu Laura, 2012, P EMNLP
[6]  
Banik Eva, 2013, P ENLG
[7]  
Belz Anja, 2011, P ENLG
[8]  
Chen David L, 2008, P ICML
[9]  
Lampouras Gerasimos, 2016, P COLING
[10]  
Lebret David, 2016, P EMNLP