Improving biterm topic model with word embeddings

被引：18

作者：

Huang, Jiajia ^{[1
]}

Peng, Min ^{[2
]}

Li, Pengwei ^{[1
]}

Hu, Zhiwei ^{[3
]}

Xu, Chao ^{[1
]}

机构：

[1] Nanjing Audit Univ, Nanjing 211815, Peoples R China

[2] Wuhan Univ, Wuhan 430072, Peoples R China

[3] Shanxi Agr Univ, Datong 030801, Peoples R China

来源：

WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS | 2020年 / 23卷 / 06期

基金：

中国国家自然科学基金;

关键词：

Topic model; Word embeddings; Short texts; Noise biterm; BTM; EXTRACTION;

D O I：

10.1007/s11280-020-00823-w

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

As one of the fundamental information extraction methods, topic model has been widely used in text clustering, information recommendation and other text analysis tasks. Conventional topic models mainly utilize word co-occurrence information in texts for topic inference. However, it is usually hard to extract a group of words that are semantically coherent and have competent representation ability when the models applied into short texts. It is because the feature space of the short texts is too sparse to provide enough co-occurrence information for topic inference. The continuous development of word embeddings brings new representation of words and more effective measurement of word semantic similarity from concept perspective. In this study, we first mine word co-occurrence patterns (i.e., biterms) from short text corpus and then calculate biterm frequency and semantic similarity between its two words. The result shows that a biterm with higher frequency or semantic similarity usually has more similar words in the corpus. Based on the result, we develop a novel probabilistic topic model, named Noise Biterm Topic Model with Word Embeddings (NBTMWE). NBTMWE extends the Biterm Topic Model (BTM) by introducing a noise topic with prior knowledge of frequency and semantic similarity of biterm. NBTMWE shows the following advantages compared with BTM: (1) It can distinguish meaningful latent topics from a noise topic which consists of some common-used words that appear in many texts of the dataset; (2) It can promote a biterm's semantically related words to the same topic during the sampling process via generalized Polya Urn (GPU) model. Using auxiliary word embeddings trained from a large scale of corpus, we report the results testing on two short text datasets (i.e., Sina Weibo and Web Snippets). Quantitatively, NBTMWE outperforms the state-of-the-art models in terms of coherence, topic word similarity and classification accuracy. Qualitatively, each of the topics generated by NBTMWE contains more semantically similar words and shows superior intelligibility.

引用

页码：3099 / 3124

页数：26

共 38 条

[1] Incorporating product description to sentiment topic models for improved aspect-based sentiment analysis
Amplayo, Reinald Kim
Lee, Seanie
Song, Min
[J]. INFORMATION SCIENCES, 2018, 454 : 200 - 215
[2] Latent Dirichlet allocation
Blei, DM
Ng, AY
Jordan, MI
[J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) : 993 - 1022
[3] BTM: Topic Modeling over Short Texts
Cheng, Xueqi
Yan, Xiaohui
Lan, Yanyan
Guo, Jiafeng
[J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (12) : 2928 - 2941
[4] Das R, 2015, PROCEEDINGS OF THE 53RD ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 7TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1, P795
[5] Using Word Embedding to Evaluate the Coherence of Topics from Twitter Data
Fang, Anjie
Macdonald, Craig
Ounis, Iadh
Habel, Philip
[J]. SIGIR'16: PROCEEDINGS OF THE 39TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2016, : 1057 - 1060
[6] Incorporating word embeddings into topic modeling of short text
Gao, Wang
Peng, Min
Wang, Hua
Zhang, Yanchun
Xie, Qianqian
Tian, Gang
[J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2019, 61 (02) : 1123 - 1145
[7] W2VLDA: Almost unsupervised system for Aspect Based Sentiment Analysis
Garcia-Pablos, Aitor
Cuadros, Montse
Rigau, German
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2018, 91 : 127 - 137
[8] Polya Urn Models
Haigh, John
[J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES A-STATISTICS IN SOCIETY, 2009, 172 : 942 - 942
[9] Probabilistic latent semantic indexing
Hofmann, T
[J]. SIGIR'99: PROCEEDINGS OF 22ND INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 1999, : 50 - 57
[10] A probabilistic method for emerging topic tracking in Microblog stream
Huang, Jiajia
Peng, Min
Wang, Hua
Cao, Jinli
Gao, Wang
Zhang, Xiuzhen
[J]. WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2017, 20 (02): : 325 - 350

← 1 2 3 4 →