Avoiding Overlap in Data Augmentation for AMR-to-Text Generation

被引:0
作者
Du, Wenchao [1 ]
Flanigan, Jeffrey [1 ]
机构
[1] Univ Calif Santa Cruz, Santa Cruz, CA 95064 USA
来源
ACL-IJCNLP 2021: THE 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 2 | 2021年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Leveraging additional unlabeled data to boost model performance is common practice in machine learning and natural language processing. For generation tasks, if there is overlap between the additional data and the target text evaluation data, then training on the additional data is training on answers of the test set. This leads to overly-inflated scores with the additional data compared to real-world testing scenarios and problems when comparing models. We study the AMR dataset and Gigaword, which is popularly used for improving AMR-to-text generators, and find significant overlap between Gigaword and a subset of the AMR dataset. We propose methods for excluding parts of Gigaword to remove this overlap, and show that our approach leads to a more realistic evaluation of the task of AMR-to-text generation. Going forward, we give simple bestpractice recommendations for leveraging additional data in AMR-to-text generation.
引用
收藏
页码:1043 / 1048
页数:6
相关论文
共 20 条
[1]  
[Anonymous], 2011, English Gigaword, fifth edition
[2]  
[Anonymous], 2015, P 2015 C EMP METH NA, DOI [10.18653/v1/d15-1166, DOI 10.18653/V1/D15-1166]
[3]  
Banarescu L, 2013, P 7 LING ANN WORKSH, P178
[4]  
Dong L, 2016, PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, P33
[5]  
Du WC, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P38
[6]  
Flanigan J., 2016, P 2016 C N AM CHAPT, V2016, P731, DOI [DOI 10.18653/V1/N16-1087, 10.18653/v1/N16-1087]
[7]  
Flanigan J, 2014, PROCEEDINGS OF THE 52ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, P1426
[8]   Neural AMR: Sequence-to-Sequence Models for Parsing and Generation [J].
Konstas, Ioannis ;
Iyer, Srinivasan ;
Yatskar, Mark ;
Choi, Yejin ;
Zettlemoyer, Luke .
PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, :146-157
[9]  
May J., 2017, Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), P536, DOI 10.18653/v1/S17-2090
[10]  
NIST Open Machine Translation, 2012, EV PLAN OPENMT12