Collecting and Using Comparable Corpora for Statistical Machine Translation

被引：0

作者：

Skadina, Inguna

Aker, Ahmet

Mastropavlos, Nikos

Su, Fangzhong

Tufis, Dan

Verlic, Mateja

Vasiljevs, Andrejs

Babych, Bogdan

Clough, Paul

Gaizauskas, Robert

Glaros, Nikos

Paramita, Monica Lestari

Pinnis, Marcis

机构：

来源：

LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2012年

关键词：

comparable corpora; under-resourced languages; machine translation; WEB;

D O I：

暂无

中图分类号：

H0 [语言学];

学科分类号：

030303 ; 0501 ; 050102 ;

摘要：

Lack of sufficient parallel data for many languages and domains is currently one of the major obstacles to further advancement of automated translation. The ACCURAT project is addressing this issue by researching methods how to improve machine translation systems by using comparable corpora. In this paper we present tools and techniques developed in the ACCURAT project that allow additional data needed for statistical machine translation to be extracted from comparable corpora. We present methods and tools for acquisition of comparable corpora from the Web and other sources, for evaluation of the comparability of collected corpora, for multi-level alignment of comparable corpora and for extraction of lexical and terminological data for machine translation. Finally, we present initial evaluation results on the utility of collected corpora in domain-adapted machine translation and real-life applications.

引用

页码：438 / 445

页数：8

共 37 条

[1]

Abdul-Rauf S., 2009, Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics, P16

[2]

[Anonymous], P 3 WEB CORP WORKSH

[3]

[Anonymous], P LREC 2012 21 27 MA

[4]

[Anonymous], 2011, P 4 WORKSH BUILD US

[5]

[Anonymous], 2011, TOOLK MULT AL INF EX

[6]

[Anonymous], 2010, COLING 2010 POSTERS

[7]

[Anonymous], P 11 C EUR ASS COMP

[8]

[Anonymous], 2010, Statistical Machine Translation

[9]

[Anonymous], 2006, P EACL WORKSH NEW TE

[10]

[Anonymous], ACL

← 1 2 3 4 →