A Benchmark Dataset for Multi-Level Complexity-Controllable Machine Translation

被引:0
作者
Tani, Kazuki [1 ]
Yuasa, Ryoya [1 ]
Takikawa, Kazuki [2 ]
Tamura, Akihiro [1 ]
Kajiwara, Tomoyuki [2 ]
Ninomiya, Takashi [2 ]
Kato, Tsuneo [1 ]
机构
[1] Doshisha Univ, Kyoto, Japan
[2] Ehime Univ, Matsuyama, Japan
来源
LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2022年
关键词
Machine Translation; Natural Language Generation; Corpus; (Creation; Annotation; etc.);
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
This paper introduces a new benchmark test dataset for multi-level complexity-controllable machine translation (MLCC-MT), which is an MT that controls the output complexity at more than two levels. In previous studies, MLCC-MT models have been evaluated on a test dataset automatically generated from the Newsela corpus, which is a document-level comparable corpus with document-level complexity. There are three issues with the existing test dataset: first, a source language sentence and its target language sentence are not necessarily an exact translation pair because they are automatically detected. Second, a target language sentence and its simplified target language sentence are not always perfectly parallel since they are automatically aligned. Third, a sentence-level complexity is not always appropriate because it is derived from an article-level complexity associated with the Newsela corpus. Therefore, we created a benchmark test dataset for Japanese-to-English MLCC-MT from the Newsela corpus by introducing an automatic filtering of data with inappropriate sentence-level complexity, manual check for parallel target language sentences with different complexity levels, and manual translation. Furthermore, we implement two MLCC-NMT frameworks with a Transformer architecture and report their performance on our test dataset as baselines for future research. Our test dataset and codes are released.
引用
收藏
页码:6744 / 6752
页数:9
相关论文
共 27 条
  • [1] Agrawal S, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P1549
  • [2] Alva-Manchego F, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF SYSTEM DEMONSTRATIONS, P49
  • [3] Alva-Manchego F, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), P4668
  • [4] [Anonymous], 2017, Proceedings of the IJCNLP 2017, System Demonstrations
  • [5] [Anonymous], 2013, ACL 2013
  • [6] [Anonymous], 2016, Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pages
  • [7] Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
  • [8] Brunato D., 2016, P 2016 C EMP METH NA, P351
  • [9] Jiang Chao, 2020, P 58 ANN M ASS COMP, P7943, DOI [DOI 10.18653/V1/2020.ACL-MAIN.709, 10.18653/v1/2020.acl-main.709]
  • [10] Katsuta A, 2018, PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), P461