Quality Estimation for Synthetic Parallel Data Generation

被引:0
作者
Rubino, Raphael [1 ]
Toral, Antonio [2 ]
Ljubesic, Nikola [3 ]
Ramirez-Sanchez, Gema [1 ]
机构
[1] Prompsit Language Engn, Elche, Spain
[2] Dublin City Univ, CNGL Sch Comp, Dublin 9, Ireland
[3] Univ Zagreb, Dept Informat & Commun Sci, Zagreb 41000, Croatia
来源
LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2014年
关键词
Under-resourced Languages; Synthetic Corpora; Machine Translation; Quality Estimation;
D O I
暂无
中图分类号
H0 [语言学];
学科分类号
030303 ; 0501 ; 050102 ;
摘要
This paper presents a novel approach for parallel data generation using machine translation and quality estimation. Our study focuses on pivot-based machine translation from English to Croatian through Slovene. We generate an English-Croatian version of the Europarl parallel corpus based on the English-Slovene Europarl corpus and the Apertium rule-based translation system for Slovene-Croatian. These experiments are to be considered as a first step towards the generation of reliable synthetic parallel data for under-resourced languages. We first collect small amounts of aligned parallel data for the Slovene-Croatian language pair in order to build a quality estimation system for sentence-level Translation Edit Rate (TER) estimation. We then infer TER scores on automatically translated Slovene to Croatian sentences and use the best translations to build an English-Croatian statistical MT system. We show significant improvement in terms of automatic metrics obtained on two test sets using our approach compared to a random selection of synthetic parallel data.
引用
收藏
页码:1843 / 1849
页数:7
相关论文
共 29 条
  • [1] [Anonymous], MT SUMM
  • [2] [Anonymous], P 4 C LANG RES EV LI
  • [3] [Anonymous], 2012, P 7 WORKSH STAT MACH
  • [4] [Anonymous], P 2 WORKSH APPL MACH
  • [5] Sentence-level ranking with quality estimation
    Avramidis, Eleftherios
    [J]. MACHINE TRANSLATION, 2013, 27 (3-4) : 239 - 256
  • [6] Banerjee P., 2013, P 14 MACH TRANSL SUM, P101
  • [7] Bertoldi Nicola, 2008, IWSLT 2008, P143
  • [8] Blatz J., 2003, JHU CLSP SUMM WORKSH
  • [9] Bojar Ondej., 2013, Proceedings of the Eighth Workshop on Statistical Machine Translation, P1
  • [10] Callison-Burch Chris, 2012, P 7 WORKSHOP STAT MA, P10