A Wikipedia-based Corpus for Contextualized Machine Translation

被引:0
|
作者
Drexler, Jennifer [1 ]
Rastogi, Pushpendre [2 ]
Aguilar, Jacqueline [3 ]
Van Durme, Benjamin [3 ]
Post, Matt [3 ]
机构
[1] MIT, Comp Sci & Artificial Intelligence Lab, Cambridge, MA 02139 USA
[2] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA
[3] Johns Hopkins Univ, Human Language Technol Ctr Excellence, Baltimore, MD 21218 USA
来源
LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2014年
关键词
Machine Translation; Domain Adaptation; Corpus;
D O I
暂无
中图分类号
H0 [语言学];
学科分类号
030303 ; 0501 ; 050102 ;
摘要
We describe a corpus for and experiments in target-contextualized machine translation (MT), in which we incorporate language models from target-language documents that are comparable in nature to the source documents. This corpus comprises (i) a set of curated English Wikipedia articles describing news events along with (ii) their comparable Spanish counterparts, (iii) a number of the Spanish source articles cited within them, and (iv) English reference translations of all the Spanish data. In experiments, we evaluate the effect on translation quality when including language models built over these English documents and interpolated with other, separately-derived, more general language model sources. We find that even under this simplistic baseline approach, we achieve significant improvements as measured by BLEU score.
引用
收藏
页码:3593 / 3596
页数:4
相关论文
共 50 条
  • [11] The parallel corpus for information extraction based on natural language processing and machine translation
    He, Honghua
    EXPERT SYSTEMS, 2019, 36 (05)
  • [12] Crowdsourcing a Wikipedia Vandalism Corpus
    Potthast, Martin
    SIGIR 2010: PROCEEDINGS OF THE 33RD ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH DEVELOPMENT IN INFORMATION RETRIEVAL, 2010, : 789 - 790
  • [13] Construction of Mizo: English Parallel Corpus for Machine Translation
    Haulai, Thangkhanhau
    Hussain, Jamal
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (08)
  • [14] Machine Translation on a Parallel Code-Switched Corpus
    Menacer, M. A.
    Langlois, D.
    Jouvet, D.
    Fohr, D.
    Mella, O.
    Smaili, K.
    ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, 11489 : 426 - 432
  • [15] The Application of Paraphrasing Technology of Machine Translation in the Construction of Corpus
    Jing, Wang
    PROCEEDINGS OF THE 2017 2ND INTERNATIONAL CONFERENCE ON EDUCATION, MANAGEMENT SCIENCE AND ECONOMICS (ICEMSE 2017), 2017, 49 : 300 - 303
  • [16] Machine Translation and Linguistic Use: An Analysis of English-French Translations Reunited in Corpus
    Loock, Rudy
    META, 2018, 63 (03) : 786 - 806
  • [17] An Analysis (and an Annotated Corpus) of User Responses to Machine Translation Output
    Pighin, Daniele
    Marquez, Lluis
    May, Jonathan
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 1131 - 1136
  • [18] A Richly Annotated, Multilingual Parallel Corpus for Hybrid Machine Translation
    Avramidis, Eleftherios
    Costa-Jussa, Marta R.
    Federmann, Christian
    Melero, Maite
    Pecina, Pavel
    van Genabith, Josef
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 2189 - 2193
  • [19] Building a Tunisian Dialect into Arabic Language Parallel Corpus for a Phrase-based Machine Translation
    Sghaier, Mohamed Ali
    Zrigui, Mounir
    VISION 2025: EDUCATION EXCELLENCE AND MANAGEMENT OF INNOVATIONS THROUGH SUSTAINABLE ECONOMIC COMPETITIVE ADVANTAGE, 2019, : 2910 - 2921
  • [20] Big-Data Based English-Chinese Corpus Collection and Mining and Machine Translation Framework
    Guo, Hang
    Jiang, Liu
    PROCEEDINGS OF THE 2021 FIFTH INTERNATIONAL CONFERENCE ON I-SMAC (IOT IN SOCIAL, MOBILE, ANALYTICS AND CLOUD) (I-SMAC 2021), 2021, : 418 - 421