Embedding Semantic Anchors to Guide Topic Models on Short Text Corpora

被引:3
作者
Steuber, Florian [1 ]
Schneider, Sinclair [1 ]
Schoenfeld, Mirco [2 ]
机构
[1] Univ Bundeswehr Munchen, Res Inst CODE, Neubiberg, Germany
[2] Univ Bayreuth, Bayreuth, Germany
关键词
Topic modeling; Short text; Word embedding; Transfer learning; Big data;
D O I
10.1016/j.bdr.2021.100293
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Documents on the social media platform Twitter are formulated in short and simple style, instead of being written extensively and elaborately. Further, the core message of a post is often encoded into characteristic phrases called hashtags. These hashtags illustrate the semantics of a post or tie it to a specific topic. In this paper, we propose multiple approaches of using hashtags and their surrounding texts to improve topic modeling of short texts. We use transfer learning by applying a pre-trained word embedding of hashtags to derive preliminary topics. These function as supervising information, or seed topics and are passed to Archetypal LDA (A-LDA), a recent variant of Latent Dirichlet Allocation. We demonstrate the effectiveness of our approach using a large corpus of posts exemplarily on Twitter. Our approaches improve the topic model's qualities in terms of various quantitative metrics. Moreover, the presented algorithms used to extract seed topics can be utilized as form of lightweight topic model by themselves. Hence, our approaches create additional analytical opportunities and can help to gain a more detailed understanding of what people are talking about on social media. By using big data in terms of millions of tweets for preprocessing and fine-tuning, we enable the classification algorithm to produce topics that are very coherent to the reader. (C) 2021 Elsevier Inc. All rights reserved.
引用
收藏
页数:13
相关论文
共 51 条
[1]   Using Topic Modeling Methods for Short-Text Data: A Comparative Analysis [J].
Albalawi, Rania ;
Yeap, Tet Hin ;
Benyoucef, Morad .
FRONTIERS IN ARTIFICIAL INTELLIGENCE, 2020, 3
[2]  
Alvarez-Melis D., 2016, P INT AAAI C WEB SOC
[3]  
[Anonymous], 2019, What is big data?
[4]   Learning Topic Models - Going beyond SVD [J].
Arora, Sanjeev ;
Ge, Rong ;
Moitra, Ankur .
2012 IEEE 53RD ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE (FOCS), 2012, :1-10
[5]  
Bellman R., 1957, Dynamic programming
[6]  
Bengio Y, 2001, ADV NEUR IN, V13, P932
[7]  
Bischof JonathanM., 2012, Proceedings of the 29th International Conference on Machine Learning, V29, P201
[8]  
Blei D. M., 2005, Advances in neural information processing systems, P147
[9]   A CORRELATED TOPIC MODEL OF SCIENCE [J].
Blei, David M. ;
Lafferty, John D. .
ANNALS OF APPLIED STATISTICS, 2007, 1 (01) :17-35
[10]  
Blei David M, 2006, P 23 INT C MACHINE L, P113, DOI [DOI 10.1145/1143844.1143859, 10.1145/1143844.1143859]