Statistical machine translation into a morphologically complex language

被引:0
作者
Oflazer, Kemal [1 ]
机构
[1] Sabanci Univ, Fac Engn & Nat Sci, TR-34956 Istanbul, Tuzla, Turkey
来源
COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING | 2008年 / 4919卷
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we present the results of our investigation into phrase-based statistical machine translation from English into Turkish - an agglutinative language with very productive inflectional and derivational word-formation processes. We investigate different representational granularities for morphological structure and find that (i) representing both Turkish and English at the morpheme-level but with some selective morpheme- grouping on the Turkish side of the training data, (ii) augmenting the training data with "sentences" comprising only the content words of the original training data to bias root word alignment, and with highly-reliable phrase-pairs from an earlier corpus-alignment (iii) re-ranking the n-best morpheme- sequence outputs of the decoder with a word-based language model, and (iv) "repairing" translated words with incorrect morphological structure and words which are out-of-vocabulary relative to the training and the language model corpus, provide an non-trivial improvement over a word-based baseline despite our very limited training data. We improve from 19.77 BLEU points for our word-based baseline model to 26.87 BLEU points for an improvement of 7.10 points or about 36% relative. We briefly discuss the applicability of BLEU to morphologically complex languages like Turkish and present a simple extension to compare tokens not in a all-or-none fashion but taking lexical-semantic and morpho-semantic similarities into account, implemented in our BLEU+ tool.
引用
收藏
页码:376 / 387
页数:12
相关论文
共 22 条
  • [1] Banerjee S, 2005, P ACL WORKSH INTR EX, V29, P65
  • [2] Corston-Oliver S, 2004, LECT NOTES COMPUT SC, V3265, P48
  • [3] El-Kahlout ID, 2006, P WORKSH STAT MACH T, P7
  • [4] Goldwater S., 2005, Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), P676
  • [5] KOEHN P, 2003, P HLT NAACL
  • [6] Koehn P., 2007, P 45 ANN M ASS COMP
  • [7] Lee YS, 2004, P HLT NAACL 2004 COM, P57
  • [8] MINKOV E, 2007, P 45 ANN M ASS COMP, P128
  • [9] Statistical machine translation with scarce resources using morpho-syntactic information
    Niessen, S
    Ney, H
    [J]. COMPUTATIONAL LINGUISTICS, 2004, 30 (02) : 181 - 204
  • [10] Oflazer K, 1996, COMPUT LINGUIST, V22, P73