Improving English-Arabic statistical machine translation with morpho-syntactic and semantic word class

被引:0
作者
Khemakhem I.T. [1 ]
Jamoussi S. [1 ]
Hamadou A.B. [1 ]
机构
[1] MIRACL Laboratory, University of Sfax
关键词
Alignment; Morpho-syntactic word classes; Semantic word classes; SMT; Statistical machine translation;
D O I
10.1504/IJISTA.2020.107225
中图分类号
学科分类号
摘要
In this paper, we present a new method for the extraction and integrating of morpho-syntactic and semantic word classes in a statistical machine translation (SMT) context to improve the quality of English-Arabic translation. It can be applied across different statistical machine translations and with languages that have complicated morphological paradigms. In our method, we first identify morpho-syntactic word classes to build up our statistical language model. Then, we apply a semantic word clustering algorithm for English. The obtained semantic word classes are projected from the English side to the featured Arabic side. This projection is based on available word alignment provided by the alignment step using GIZA++ tool. Finally, we apply a new process to incorporate semantic classes in order to improve the SMT quality. We show its efficacy on small and larger English to Arabic translation tasks. The experimental results show that introducing morpho-syntactic and semantic word classes achieves 7.7% of relative improvement on the BLEU score. © 2020 Inderscience Enterprises Ltd.
引用
收藏
页码:172 / 190
页数:18
相关论文
共 32 条
[1]  
Badr I., Zbib R., Glass J., Segmentation for English-to-Arabic statistical machine translation, Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies, pp. 153-156, (2008)
[2]  
Baker K., Bethard S., Blodgood M., Brown R., Callison-Burch C., Copper-Smith G., Dorr B., Filardo W., Giles K., Semantically informed machine translation, Final Report of the 2010 Summer Camp for Advanced Language Exploration (SCALE), (2009)
[3]  
Banchs R., Costa-Jussa M., A semantic feature for statistical machine translation, Proceedings of SSST-5, Fifth Workshop on Syntax, Semantics and Structure in Statistical Translation, ACL HLT 2011, pp. 126-134, (2011)
[4]  
Bilmes J., Kirchhoff K., Factored language models and generalized parallel backoff, Proceeding of Human Language Technology Conference, pp. 4-6, (2003)
[5]  
Brown P., Della Pietra V., Della Pietra S., Mercer R., The mathematics of statistical machine translation: Parameter estimation, Computational Linguistics, 19, 1, pp. 263-311, (1993)
[6]  
Carpuat M., Wu D., How phrase sense disambiguation outperforms word sense disambiguation for statistical machine translation, 11th International Conference on Theoretical and Methodological Issues in Machine Translation, (2007)
[7]  
Carpuat M., Wu D., Evaluation of context-dependent phrasal translation lexicons for statistical machine translation, 6th International Conference on Language Resources and Evaluation (LREC), (2008)
[8]  
Clarkson P., Rosenfeld R., Statistical language modeling using the CMU - Cambridge toolkit, Proceedings of the European Conference on Speech Communication and Technology, pp. 2707-2710, (1997)
[9]  
Costa-Jussa M., Gupta P., Banchs R., Rosso P., English-to-Hindi system description for WMT 2014: Deep source-context features for Moses, Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 79-83, (2014)
[10]  
Habash N., Rambow O., Roth R., MADA+TOKAN: A toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization, Proceedings of the Second International Conference on Arabic Language Resources and Tools, (2009)