Incorporating Word Embedding into Cross-lingual Topic Modeling

被引:4
作者
Chang, Chia-Hsuan [1 ]
Hwang, San-Yih [1 ]
Xui, Tou-Hsiang [1 ]
机构
[1] Natl Sun Yat Sen Univ, Dept Informat Management, Kaohsiung, Taiwan
来源
2018 IEEE INTERNATIONAL CONGRESS ON BIG DATA (IEEE BIGDATA CONGRESS) | 2018年
关键词
cross-lingual topic model; text mining; Latent Dirichlet Allocation; word space;
D O I
10.1109/BigDataCongress.2018.00010
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, we address the cross-lingual topic modeling, which is an important technique that enables global enterprises to detect and compare topic trends across global markets. Previous works in cross-lingual topic modeling have proposed methods that utilize parallel or comparable corpus in constructing the polylingual topic model. However, parallel or comparable corpus in many cases are not available. In this research, we incorporate techniques of mapping cross-lingual word space and the topic modeling (LDA) and propose two methods: Translated Corpus with LDA (TC-LDA) and Post Match LDA (PM-LDA). The cross-lingual word space mapping allows us to compare words of different languages, and LDA enables us to group words into topics. Both TC-LDA and PM-LDA do not need parallel or comparable corpus and hence have more applicable domains. The effectiveness of both methods is evaluated using UM-Corpus and WS-353. Our evaluation results indicate that both methods are able to identify similar documents written in different language. In addition, PM-LDA is shown to achieve better performance than TC-LDA, especially when document length is short.
引用
收藏
页码:17 / 24
页数:8
相关论文
共 39 条
[11]  
Dumais S.T., 1997, AAAI spring symposium on cross-language text and speech retrieval, V15, P21
[12]  
Ester M., 1996, P 2 INT C KNOWL DISC, V96, P226
[13]  
Faruqui Manaal, 2014, Improving vector space word representations using multilingual correlation
[14]  
Gutierrez ED., 2016, Transactions of the Association for Computational Linguistics, V4, P47
[15]  
Hall D., 2008, P 2008 C EMP METH NA, P363, DOI DOI 10.3115/1613715.1613763
[16]  
Jarmasz M., 2012, ARXIV12040140
[17]  
Kingma D. P., P 3 INT C LEARN REPR
[18]  
Landauer T. K., 1990, P 6 ANN C UW CTR NEW
[19]   DIVERGENCE MEASURES BASED ON THE SHANNON ENTROPY [J].
LIN, JH .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1991, 37 (01) :145-151
[20]   Multilingual Topic Models for Bilingual Dictionary Extraction [J].
Liu, Xiaodong ;
Duh, Kevin ;
Matsumoto, Yuji .
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2015, 14 (03)