A Benchmark Dataset for Multi-Level Complexity-Controllable Machine Translation

被引：0

作者：

Tani, Kazuki ^{[1
]}

Yuasa, Ryoya ^{[1
]}

Takikawa, Kazuki ^{[2
]}

Tamura, Akihiro ^{[1
]}

Kajiwara, Tomoyuki ^{[2
]}

Ninomiya, Takashi ^{[2
]}

Kato, Tsuneo ^{[1
]}

机构：

[1] Doshisha Univ, Kyoto, Japan

[2] Ehime Univ, Matsuyama, Japan

来源：

LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2022年

关键词：

Machine Translation; Natural Language Generation; Corpus; (Creation; Annotation; etc.);

D O I：

暂无

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

This paper introduces a new benchmark test dataset for multi-level complexity-controllable machine translation (MLCC-MT), which is an MT that controls the output complexity at more than two levels. In previous studies, MLCC-MT models have been evaluated on a test dataset automatically generated from the Newsela corpus, which is a document-level comparable corpus with document-level complexity. There are three issues with the existing test dataset: first, a source language sentence and its target language sentence are not necessarily an exact translation pair because they are automatically detected. Second, a target language sentence and its simplified target language sentence are not always perfectly parallel since they are automatically aligned. Third, a sentence-level complexity is not always appropriate because it is derived from an article-level complexity associated with the Newsela corpus. Therefore, we created a benchmark test dataset for Japanese-to-English MLCC-MT from the Newsela corpus by introducing an automatic filtering of data with inappropriate sentence-level complexity, manual check for parallel target language sentences with different complexity levels, and manual translation. Furthermore, we implement two MLCC-NMT frameworks with a Transformer architecture and report their performance on our test dataset as baselines for future research. Our test dataset and codes are released.

引用

页码：6744 / 6752

页数：9

共 27 条

[1] Agrawal S, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P1549
[2] Alva-Manchego F, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF SYSTEM DEMONSTRATIONS, P49
[3] Alva-Manchego F, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), P4668
[4] [Anonymous], 2017, Proceedings of the IJCNLP 2017, System Demonstrations
[5] [Anonymous], 2013, ACL 2013
[6] [Anonymous], 2016, Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pages
[7] Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
[8] Brunato D., 2016, P 2016 C EMP METH NA, P351
[9] Jiang Chao, 2020, P 58 ANN M ASS COMP, P7943, DOI [DOI 10.18653/V1/2020.ACL-MAIN.709, 10.18653/v1/2020.acl-main.709]
[10] Katsuta A, 2018, PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), P461

← 1 2 3 →