Word Re-Segmentation in Chinese-Vietnamese Machine Translation

被引:10
作者
Phuoc Tran [1 ]
Dien Dinh [2 ]
Nguyen, Long H. B. [2 ]
机构
[1] Ton Duc Thang Univ, Fac Informat Technol, Ho Chi Minh City, Vietnam
[2] VNU Univ Sci, Fac Informat Technol, Ho Chi Minh City, Vietnam
关键词
Word boundary; word segmentation; character; word re-segmentation; chinese-vietnamese machine translation; isolated language;
D O I
10.1145/2988237
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In isolated languages, such as Chinese and Vietnamese, words are not separated by spaces, and a word may be formed by one or more syllables. Therefore, word segmentation (WS) is usually the first process that is implemented in the machine translation process. WS in the source and target languages is based on different training corpora, and WS approaches may not be the same. Therefore, the WS that results in these two languages are not often homologous, and thus word alignment results in many 1-n and n-1 alignment pairs in statisticalmachine translation, which degrades the performance ofmachine translation. In this article, we will adjust the WS for both Chinese and Vietnamese in particular and for isolated language pairs in general andmake the word boundary of the two languages more symmetric in order to strengthen 1-1 alignments and enhance machine translation performance. We have tested this method on the Computational Linguistics Center's corpus, which consists of 35,623 sentence pairs. The experimental results show that our method has significantly improved the performance of machine translation compared to the baseline translation system, WS translation system, and anchor language-based WS translation systems.
引用
收藏
页数:22
相关论文
共 25 条
[1]  
[Anonymous], EACL 09 P 12 C EUR A
[2]  
[Anonymous], P JOINT 5 WORKSH STA
[3]  
[Anonymous], ARPA WORKSH HUM LANG
[4]  
[Anonymous], MMSEG WORD IDENTIFIC
[5]  
[Anonymous], LECT NOTES COMPUTER
[6]  
[Anonymous], IJCNLP 2008
[7]  
[Anonymous], P HUM LANG TECHN 200
[8]  
[Anonymous], 2008, Proceedings of the Third Workshop on Statistical Machine Translation
[9]  
[Anonymous], EAMT 2012
[10]  
[Anonymous], 2001, CONDITIONAL RANDOM F