Combining IR and LDA Topic Modeling for Filtering Microblogs

被引:35
作者
Hajjem, Malek [1 ]
Latiri, Chiraz [1 ]
机构
[1] Tunis EL Manar Univ, Fac Sci Tunis, LIPAH Res Lab, Campus Univ Farhat Hached,BP n94, Tunis 1068, Tunisia
来源
KNOWLEDGE-BASED AND INTELLIGENT INFORMATION & ENGINEERING SYSTEMS | 2017年 / 112卷
关键词
Microblogs; LDA; Pruning irrelevant tweets; Information Retrieval; Aggregation;
D O I
10.1016/j.procs.2017.08.166
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Twitter is a networking micro-blogging service where users post millions of short messages every day. Building multilingual corpora from these microblogs contents can be useful to perform several computational tasks such as opinion mining However, Twitter data gathering involves the problem of irrelevant included data. Recent literary works have proved that topic models such as Latent Dirichlet Allocation (LDA) are not consistent when applied to short texts like tweets. In order to prune the irrelevant tweets, we investigate in this paper a novel method to improve topics learned from Twitter content without modifying the basic machinery of LDA. This latter is based on a pooling process which combines Information retrieval (IR) approach and LDA. This is achieved through an aggregation strategy based on IR task to retrieve similar tweets in a same cluster. The result of tweet pooling is then used as an input for a basic LDA to overcome the sparsity problem of Twitter content. Empirical results highlight that tweets aggregation based on IR and LDA leads to an interesting improvement in a variety of measures for topic coherence, in comparison to unmodified LDA baseline and a variety of pooling schemes. (C) 2017 The Authors. Published by Elsevier B.V.
引用
收藏
页码:761 / 770
页数:10
相关论文
共 23 条
[1]  
Alvarez-Melis D., 2016, 10 INT AAAI C WEB SO
[2]  
[Anonymous], 2011, WIMS 11 P INT C WEB
[3]  
[Anonymous], 2008, Advances in Neural Information Processing Systems
[4]  
[Anonymous], 2008, Introduction to information retrieval
[5]  
[Anonymous], 2010, P 3 ACM INT C WEB SE, DOI DOI 10.1145/1718487.1718520
[6]  
Blei D.M., P ADV NEUR INF PROC
[7]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[8]  
DEERWESTER S, 1990, J AM SOC INFORM SCI, V41, P391, DOI 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO
[9]  
2-9
[10]  
Fraisse A., 2014, P 7 WORKSHOP BUILDIN, P17