Improving biterm topic model with word embeddings

被引:18
作者
Huang, Jiajia [1 ]
Peng, Min [2 ]
Li, Pengwei [1 ]
Hu, Zhiwei [3 ]
Xu, Chao [1 ]
机构
[1] Nanjing Audit Univ, Nanjing 211815, Peoples R China
[2] Wuhan Univ, Wuhan 430072, Peoples R China
[3] Shanxi Agr Univ, Datong 030801, Peoples R China
来源
WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS | 2020年 / 23卷 / 06期
基金
中国国家自然科学基金;
关键词
Topic model; Word embeddings; Short texts; Noise biterm; BTM; EXTRACTION;
D O I
10.1007/s11280-020-00823-w
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
As one of the fundamental information extraction methods, topic model has been widely used in text clustering, information recommendation and other text analysis tasks. Conventional topic models mainly utilize word co-occurrence information in texts for topic inference. However, it is usually hard to extract a group of words that are semantically coherent and have competent representation ability when the models applied into short texts. It is because the feature space of the short texts is too sparse to provide enough co-occurrence information for topic inference. The continuous development of word embeddings brings new representation of words and more effective measurement of word semantic similarity from concept perspective. In this study, we first mine word co-occurrence patterns (i.e., biterms) from short text corpus and then calculate biterm frequency and semantic similarity between its two words. The result shows that a biterm with higher frequency or semantic similarity usually has more similar words in the corpus. Based on the result, we develop a novel probabilistic topic model, named Noise Biterm Topic Model with Word Embeddings (NBTMWE). NBTMWE extends the Biterm Topic Model (BTM) by introducing a noise topic with prior knowledge of frequency and semantic similarity of biterm. NBTMWE shows the following advantages compared with BTM: (1) It can distinguish meaningful latent topics from a noise topic which consists of some common-used words that appear in many texts of the dataset; (2) It can promote a biterm's semantically related words to the same topic during the sampling process via generalized Polya Urn (GPU) model. Using auxiliary word embeddings trained from a large scale of corpus, we report the results testing on two short text datasets (i.e., Sina Weibo and Web Snippets). Quantitatively, NBTMWE outperforms the state-of-the-art models in terms of coherence, topic word similarity and classification accuracy. Qualitatively, each of the topics generated by NBTMWE contains more semantically similar words and shows superior intelligibility.
引用
收藏
页码:3099 / 3124
页数:26
相关论文
共 38 条
  • [1] Incorporating product description to sentiment topic models for improved aspect-based sentiment analysis
    Amplayo, Reinald Kim
    Lee, Seanie
    Song, Min
    [J]. INFORMATION SCIENCES, 2018, 454 : 200 - 215
  • [2] Latent Dirichlet allocation
    Blei, DM
    Ng, AY
    Jordan, MI
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) : 993 - 1022
  • [3] BTM: Topic Modeling over Short Texts
    Cheng, Xueqi
    Yan, Xiaohui
    Lan, Yanyan
    Guo, Jiafeng
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (12) : 2928 - 2941
  • [4] Das R, 2015, PROCEEDINGS OF THE 53RD ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 7TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1, P795
  • [5] Using Word Embedding to Evaluate the Coherence of Topics from Twitter Data
    Fang, Anjie
    Macdonald, Craig
    Ounis, Iadh
    Habel, Philip
    [J]. SIGIR'16: PROCEEDINGS OF THE 39TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2016, : 1057 - 1060
  • [6] Incorporating word embeddings into topic modeling of short text
    Gao, Wang
    Peng, Min
    Wang, Hua
    Zhang, Yanchun
    Xie, Qianqian
    Tian, Gang
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2019, 61 (02) : 1123 - 1145
  • [7] W2VLDA: Almost unsupervised system for Aspect Based Sentiment Analysis
    Garcia-Pablos, Aitor
    Cuadros, Montse
    Rigau, German
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2018, 91 : 127 - 137
  • [8] Polya Urn Models
    Haigh, John
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES A-STATISTICS IN SOCIETY, 2009, 172 : 942 - 942
  • [9] Probabilistic latent semantic indexing
    Hofmann, T
    [J]. SIGIR'99: PROCEEDINGS OF 22ND INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 1999, : 50 - 57
  • [10] A probabilistic method for emerging topic tracking in Microblog stream
    Huang, Jiajia
    Peng, Min
    Wang, Hua
    Cao, Jinli
    Gao, Wang
    Zhang, Xiuzhen
    [J]. WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2017, 20 (02): : 325 - 350