Extracting parallel fragments from comparable documents using a generative model

被引:1
作者
Bakhshaei, Somayeh [1 ]
Safabakhsh, Reza [1 ]
Khadivi, Shahram [2 ,3 ]
机构
[1] Amirkabir Univ Technol, Comp Engn & Informat Technol Dept, Tehran, Iran
[2] eBay Inc, Aachen, Germany
[3] Amirkabir Univ Technol, Tehran, Iran
关键词
Fragment extraction; Comparable corpora; Generative model; Statistical machine translation; Persian; English; German;
D O I
10.1016/j.csl.2018.07.002
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Although parallel corpora are essential language resources for many natural language processing tasks, they are rare or even not available for many language pairs. Instead, comparable corpora are widely available and contain parallel fragments of information that can be used in applications like statistical machine translation systems. In this research, we propose a generative latent Dirichlet allocation based model for extracting parallel fragments from comparable documents without using any initial parallel data or bilingual lexicon. The experimental results show significant improvement if the extracted fragments generated by the proposed method are used for augmenting an existing parallel corpus in an statistical machine translation system. According to the human judgment, the accuracy of the proposed method for an English-Persian task is about 59.7%. Also, the out of vocabulary error rate for the same task is reduced by 28%. (C) 2018 Elsevier Ltd. All rights reserved.
引用
收藏
页码:25 / 42
页数:18
相关论文
共 76 条
[1]  
Aker A., 2012, Proceedings of the 24th International Conference on Computational Linguistics of Posters Demonstration (COLING'12), P23
[2]  
[Anonymous], PROBABILISTIC TOPIC
[3]  
[Anonymous], P 32 ANN M GFKL 2008
[4]  
[Anonymous], MACH LEARN
[5]  
[Anonymous], **DROPPED REF**
[6]  
[Anonymous], P NAACL HLT
[7]  
[Anonymous], 2008, P 14 ACM SIGKDD INT, DOI DOI 10.1145/1401890.1401960
[8]  
[Anonymous], P 7 INT WORKSH BUILD
[9]  
[Anonymous], P LREC
[10]  
[Anonymous], 2002, P 7 INT C SPOK LANG