DiaBLa: a corpus of bilingual spontaneous written dialogues for machine translation

被引：4

作者：

Bawden, Rachel ^{[1
]}

Bilinski, Eric ^{[2
]}

Lavergne, Thomas ^{[3
]}

Rosset, Sophie ^{[2
]}

机构：

[1] Univ Edinburgh, Sch Informat, Edinburgh, Midlothian, Scotland

[2] Univ Paris Saclay, LIMSI, CNRS, Orsay, France

[3] Univ Paris Sud, LIMSI, CNRS, Univ Paris Saclay, Orsay, France

来源：

LANGUAGE RESOURCES AND EVALUATION | 2021年 / 55卷 / 03期

关键词：

Machine translation; Corpus; Dataset; Evaluation; Bilingual; Dialogue; Chat;

D O I：

10.1007/s10579-020-09514-4

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

We present a new English-French dataset for the evaluation of Machine Translation (MT) for informal, written bilingual dialogue. The test set contains 144 spontaneous dialogues (5700+ sentences) between native English and French speakers, mediated by one of two neural MT systems in a range of role-play settings. The dialogues are accompanied by fine-grained sentence-level judgments of MT quality, produced by the dialogue participants themselves, as well as by manually normalised versions and reference translations produced a posteriori. The motivation for the corpus is twofold: to provide (i) a unique resource for evaluating MT models, and (ii) a corpus for the analysis of MT-mediated communication. We provide an initial analysis of the corpus to confirm that the participants' judgments reveal perceptible differences in MT quality between the two MT systems used.

引用

页码：635 / 660

页数：26