Quality Estimation for Synthetic Parallel Data Generation

被引:0
作者
Rubino, Raphael [1 ]
Toral, Antonio [2 ]
Ljubesic, Nikola [3 ]
Ramirez-Sanchez, Gema [1 ]
机构
[1] Prompsit Language Engn, Elche, Spain
[2] Dublin City Univ, CNGL Sch Comp, Dublin 9, Ireland
[3] Univ Zagreb, Dept Informat & Commun Sci, Zagreb 41000, Croatia
来源
LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2014年
关键词
Under-resourced Languages; Synthetic Corpora; Machine Translation; Quality Estimation;
D O I
暂无
中图分类号
H0 [语言学];
学科分类号
030303 ; 0501 ; 050102 ;
摘要
This paper presents a novel approach for parallel data generation using machine translation and quality estimation. Our study focuses on pivot-based machine translation from English to Croatian through Slovene. We generate an English-Croatian version of the Europarl parallel corpus based on the English-Slovene Europarl corpus and the Apertium rule-based translation system for Slovene-Croatian. These experiments are to be considered as a first step towards the generation of reliable synthetic parallel data for under-resourced languages. We first collect small amounts of aligned parallel data for the Slovene-Croatian language pair in order to build a quality estimation system for sentence-level Translation Edit Rate (TER) estimation. We then infer TER scores on automatically translated Slovene to Croatian sentences and use the best translations to build an English-Croatian statistical MT system. We show significant improvement in terms of automatic metrics obtained on two test sets using our approach compared to a random selection of synthetic parallel data.
引用
收藏
页码:1843 / 1849
页数:7
相关论文
共 29 条
  • [21] BLEU: a method for automatic evaluation of machine translation
    Papineni, K
    Roukos, S
    Ward, T
    Zhu, WJ
    [J]. 40TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, 2002, : 311 - 318
  • [22] This sentence is wrong. Detecting errors in machine-translated sentences
    Raybaud, Sylvain
    Langlois, David
    Smaili, Kamel
    [J]. MACHINE TRANSLATION, 2011, 25 (01) : 1 - 34
  • [23] Rubino R., 2013, P 6 INT JOINT C NAT, P1167
  • [24] Sanchez-Martinez F, 2011, P 15 ANN C EUR ASS M, P97
  • [25] Specia L., 2013, P MACH TRANSL SUMM, V14, P167
  • [26] Stolcke A., 2011, Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop, P5
  • [27] Ueffing N., 2003, MT SUMM
  • [28] Utiyama M., 2008, P INT WORKSH SPOK LA, P77
  • [29] Wu Hua, 2009, ACLIJCNLP 2009-Joint Conf. of the 47th Annual Meeting of the Association for Computational Linguistics and 4th Int. Joint Conf. on Natural Language Processing of the AFNLP, P154, DOI DOI 10.3115/1687878.1687902