Using Topic Models in Domain Adaptation

被引:0
作者
Zahabi, Samira Tofighi [1 ]
Bakhshaei, Somayeh [1 ]
Khadivi, Shahram [1 ]
机构
[1] Amirkabir Univ Technol, HLT Lab, Tehran, Iran
来源
2014 7th International Symposium on Telecommunications (IST) | 2014年
关键词
Translation Model; Natural Language Processing; Topic Model; domain adaptation;
D O I
暂无
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
an important factor of a corpus is its domain, usually the quality of a SMT system trained on an in-domain corpus increases by adding out-of-domain sentences to its training corpus. In this paper we have shown out-of-domain corpora may also contains sentences which are proper for improving the quality of in-domain corpus. These sentences have words and phrases that occur in in-domain corpora so, their context is more similar to the context of in-domain parallel corpus and is far from context of out-of-domain parallel corpora. In this paper we suggest a method based on topic models to extract some sentences from out-of-domain parallel corpora that their context are similar to in-domain parallel corpus. We used these extracted sentences for training an SMT system. Finally, we will show the BLEU score of the system output increases about 4.69% by adding these extra information to its training corpus.
引用
收藏
页码:539 / 543
页数:5
相关论文
共 9 条
[1]  
[Anonymous], 2002, ACM Transactions on Asian Language Information Processing
[2]  
[Anonymous], MACHINE TRANSLATION
[3]  
Bakhshaei S, 2010, TEL IST 2010 5 INT S
[4]   Probabilistic Topic Models [J].
Blei, David M. .
COMMUNICATIONS OF THE ACM, 2012, 55 (04) :77-84
[5]  
Jabbari F., DEV OPEN DOMAIN ENGL
[6]  
Koehn P., 2007, MOSES OPEN SOURCE TO
[7]  
Moore R. C., 2010, P ACL 2010 C ASS COM
[8]  
Resnik P., 2012, P 50 ANN M ASS COMP, V2
[9]  
Stolcke Andreas., 2002, SRILM AN EXTENSIBLE