Pseudo-document simulation for comparing LDA, GSDMM and GPM topic models on short and sparse text using Twitter data

被引:0
作者
Christoph Weisser
Christoph Gerloff
Anton Thielmann
Andre Python
Arik Reuter
Thomas Kneib
Benjamin Säfken
机构
[1] Georg-August-Universität Göttingen,
[2] Campus-Institut Data Science (CIDAS),undefined
[3] Zhejiang University,undefined
[4] Clausthal University of Technology,undefined
来源
Computational Statistics | 2023年 / 38卷
关键词
Topic models; Collapsed Gibbs sampler algorithm for the Dirichlet multinomial model; Gamma-Poisson mixture topic model; Latent Dirichlet allocation; Model evaluation; Pseudo-document simulation; Covid-19; Social media; Twitter;
D O I
暂无
中图分类号
学科分类号
摘要
Topic models are a useful and popular method to find latent topics of documents. However, the short and sparse texts in social media micro-blogs such as Twitter are challenging for the most commonly used Latent Dirichlet Allocation (LDA) topic model. We compare the performance of the standard LDA topic model with the Gibbs Sampler Dirichlet Multinomial Model (GSDMM) and the Gamma Poisson Mixture Model (GPM), which are specifically designed for sparse data. To compare the performance of the three models, we propose the simulation of pseudo-documents as a novel evaluation method. In a case study with short and sparse text, the models are evaluated on tweets filtered by keywords relating to the Covid-19 pandemic. We find that standard coherence scores that are often used for the evaluation of topic models perform poorly as an evaluation metric. The results of our simulation-based approach suggest that the GSDMM and GPM topic models may generate better topics than the standard LDA model.
引用
收藏
页码:647 / 674
页数:27
相关论文
共 35 条
  • [1] Blei D(2016)Variational inference: a review for statisticians J Am Stat Assoc 112 859-877
  • [2] Kucukelbir A(2001)Latent Dirichlet allocation Adv Neural Inf Process Syst 14 601-608
  • [3] McAuliffe J(2011)Algorithms for nonnegative matrix factorization with the beta-divergence Neural Comput 23 2421-2456
  • [4] Blei D(2020)TTLocVis: a twitter topic location visualization package J. Open Source Softw 5 2507-966
  • [5] Ng A(1994)The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem J Am Stat Assoc 89 958-134
  • [6] Jordan M(2000)Text classification from labeled and unlabeled documents using EM Mach Learn 39 103-2830
  • [7] Févotte C(2011)Scikit-learn: machine learning in python J Mach Learn Res 12 2825-546
  • [8] Idier J(2002)The use of bigrams to enhance text categorization Inf Process Manage 38 529-undefined
  • [9] Kant G(undefined)undefined undefined undefined undefined-undefined
  • [10] Weisser C(undefined)undefined undefined undefined undefined-undefined