Improved Topic Modeling in Twitter Through Community Pooling

被引:5
作者
Albanese, Federico [1 ,2 ]
Feuerstein, Esteban [1 ,3 ]
机构
[1] Univ Buenos Aires, Inst Ciencias Comp, CONICET, Buenos Aires, DF, Argentina
[2] Univ Buenos Aires, Inst Calculo, CONICET, Buenos Aires, DF, Argentina
[3] Univ Buenos Aires, Dept Comp, Buenos Aires, DF, Argentina
来源
STRING PROCESSING AND INFORMATION RETRIEVAL, SPIRE 2021 | 2021年 / 12944卷
关键词
Topic modelling; Community detection; Twitter; Text mining; Text clustering; TIME;
D O I
10.1007/978-3-030-86692-1_17
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Social networks play a fundamental role in propagation of information and news. Characterizing the content of the messages becomes vital for different tasks, like breaking news detection, personalized message recommendation, fake users detection, information flow characterization and others. However, Twitter posts are short and often less coherent than other text documents, which makes it challenging to apply text mining algorithms to these datasets efficiently. Tweet-pooling (aggregating tweets into longer documents) has been shown to improve automatic topic decomposition, but the performance achieved in this task varies depending on the pooling method. In this paper, we propose a new pooling scheme for topic modelling in Twitter, which groups tweets whose authors belong to the same community (group of users who mainly interact with each other but not with other groups) on a user interaction graph. We present a complete evaluation of this methodology, state of the art schemes and previous pooling models in terms of the cluster quality, document retrieval tasks performance and supervised machine learning classification score. Results show that our Community polling method outperformed other methods on the majority of metrics in two heterogeneous datasets, while also reducing the running time. This is useful when dealing with big amounts of noisy and short user-generated social media texts. Overall, our findings contribute to an improved methodology for identifying the latent topics in a Twitter dataset, without the need of modifying the basic machinery of a topic decomposition model.
引用
收藏
页码:209 / 216
页数:8
相关论文
共 21 条
[1]   User graph topic model [J].
Akhtar, Nadeem ;
Beg, M. M. Sufyan .
JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2019, 36 (03) :2229-2240
[2]  
Al-Sultany G.A., 2019, Int. J. Eng. Technol., V8, P503
[3]  
Albanese F., 2020, arXiv
[4]  
Alvarez-Melis D., 2016, P INT AAAI C WEB SOC, V10
[5]  
[Anonymous], 2013, P 7 INT C WEBLOGS SO, DOI DOI 10.1609/ICWSM.V7I1.14434
[6]   Time to #Protest: Selective Exposure, Cascading Activation, and Framing in Social Media [J].
Aruguete, Natalia ;
Calvo, Ernesto .
JOURNAL OF COMMUNICATION, 2018, 68 (03) :480-502
[7]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[8]   Fast unfolding of communities in large networks [J].
Blondel, Vincent D. ;
Guillaume, Jean-Loup ;
Lambiotte, Renaud ;
Lefebvre, Etienne .
JOURNAL OF STATISTICAL MECHANICS-THEORY AND EXPERIMENT, 2008,
[9]  
Gethers M, 2010, PROC IEEE INT CONF S
[10]  
Giorgi S, 2018, Arxiv, DOI arXiv:1808.09600