iCPE: A Hybrid Data Selection Model for SMT Domain Adaptation

被引:0
作者
Wang, Longyue [1 ]
Wong, Derek F. [1 ]
Chao, Lidia S. [1 ]
Lu, Yi [1 ]
Xing, Junwen [1 ]
机构
[1] Univ Macau, Dept Comp & Informat Sci, Nat Language Proc & Portuguese Chinese Machine Tr, Macau, Macau, Peoples R China
来源
CHINESE COMPUTATIONAL LINGUISTICS AND NATURAL LANGUAGE PROCESSING BASED ON NATURALLY ANNOTATED BIG DATA | 2013年 / 8208卷
关键词
Data Selection; Statistical Machine Translation; Domain Adaptation; Hybrid Model; Similarity Metrics;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data selection is a significant technique to enhance the data-driven models especially for large-scale natural language processing (NLP). Recent research on statistical machine translation (SMT) domain adaptation focuses on the usage of various individual data selection models. In this paper, we proposed a hybrid data selection model named iCPE, which combines three state-of-the-art similarity metrics: Cosine tf-idf, Perplexity and Edit distance at both corpus level and model level. We conduct the experiments on Hong Kong Law Chinese-English corpus and the results show that this simple and effective hybrid model performs better over the baseline system trained on entire data as well as the best rival method. This consistently boosting performance of the proposed approach has a profound implication for mining very large corpora in a computationally-limited environment.
引用
收藏
页码:280 / 290
页数:11
相关论文
共 27 条
[1]  
[Anonymous], P 7 ACL WORKSH STAT
[2]  
[Anonymous], 2005, MT SUMMIT
[3]  
[Anonymous], P 2 SIGHAN WORKSH CH
[4]  
[Anonymous], P 2 CIPS SIGHAN JOIN
[5]  
[Anonymous], INT J COMPUTATIONAL
[6]  
[Anonymous], P 2009 C EMP METH NA
[7]  
[Anonymous], P REC ADV NAT LANG P
[8]  
[Anonymous], IWSLT
[9]  
[Anonymous], P 40 ANN M ASS COMP
[10]  
[Anonymous], P EAMT