A Wikipedia-based Corpus for Contextualized Machine Translation

被引:0
作者
Drexler, Jennifer [1 ]
Rastogi, Pushpendre [2 ]
Aguilar, Jacqueline [3 ]
Van Durme, Benjamin [3 ]
Post, Matt [3 ]
机构
[1] MIT, Comp Sci & Artificial Intelligence Lab, Cambridge, MA 02139 USA
[2] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA
[3] Johns Hopkins Univ, Human Language Technol Ctr Excellence, Baltimore, MD 21218 USA
来源
LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2014年
关键词
Machine Translation; Domain Adaptation; Corpus;
D O I
暂无
中图分类号
H0 [语言学];
学科分类号
030303 ; 0501 ; 050102 ;
摘要
We describe a corpus for and experiments in target-contextualized machine translation (MT), in which we incorporate language models from target-language documents that are comparable in nature to the source documents. This corpus comprises (i) a set of curated English Wikipedia articles describing news events along with (ii) their comparable Spanish counterparts, (iii) a number of the Spanish source articles cited within them, and (iv) English reference translations of all the Spanish data. In experiments, we evaluate the effect on translation quality when including language models built over these English documents and interpolated with other, separately-derived, more general language model sources. We find that even under this simplistic baseline approach, we achieve significant improvements as measured by BLEU score.
引用
收藏
页码:3593 / 3596
页数:4
相关论文
共 50 条
  • [31] Content-Equivalent Translated Parallel News Corpus and Extension of Domain Adaptation for Neural Machine Translation
    Mino, Hideya
    Tanaka, Hideki
    Ito, Hitoshi
    Goto, Isao
    Yamada, Ichiro
    Tokunaga, Takenobu
    [J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 3616 - 3622
  • [32] Creating domain-specific translation memories for machine translation finetuning: the TRENCARD bilingual cardiology corpus
    Dogru, Gokhan
    [J]. TRADUMATICA-TRADUCCIO I TECNOLOGIES DE LA INFORMACIO I LA COMUNICACIO, 2024, (22): : 1 - 30
  • [33] Machine Translation Based on Domain Adaptive Language Model
    Li, Lingling
    Chen, Xianlong
    Xu, Yiling
    [J]. 2020 16TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY (CIS 2020), 2020, : 116 - 120
  • [34] The extraction method used for English-Chinese machine translation corpus based on bilingual sentence pair coverage
    Dang, Penghua
    [J]. OPEN COMPUTER SCIENCE, 2024, 14 (01):
  • [35] Russian Web Tables: A Public Corpus of Web Tables for Russian Language Based on Wikipedia
    Fedorov P.E.
    Mironov A.V.
    Chernishev G.A.
    [J]. Lobachevskii Journal of Mathematics, 2023, 44 (1) : 111 - 122
  • [36] Improving Parallel Corpus Quality for Chinese-Vietnamese Statistical Machine Translation
    Huu-anh Tran
    Yuhang Guo
    Ping Jian
    Shumin Shi
    Heyan Huang
    [J]. JournalofBeijingInstituteofTechnology, 2018, 27 (01) : 127 - 136
  • [37] Improving Parallel Corpus Quality for Chinese-Vietnamese Statistical Machine Translation
    Tran H.-A.
    Guo Y.
    Jian P.
    Shi S.
    Huang H.
    [J]. Journal of Beijing Institute of Technology (English Edition), 2018, 27 (01): : 127 - 136
  • [38] The FAUST Corpus of Adequacy Assessments for Real-World Machine Translation Output
    Pighin, Daniele
    Marquez, Lluis
    Formiga, Lluis
    [J]. LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 29 - 35
  • [39] HindEnCorp - Hindi-English and Hindi-only Corpus for Machine Translation
    Bojar, Ondrej
    Diatka, Vojtech
    Rychly, Pavel
    Stranak, Pavel
    Suchomel, Vit
    Tamchyna, Ales
    Zeman, Daniel
    [J]. LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 3550 - 3555
  • [40] caWaC - A web corpus of Catalan and its application to language modeling and machine translation
    Ljubesic, Nikola
    Toral, Antonio
    [J]. LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 1728 - 1732