Topic modeling algorithms and applications: A survey

被引:123
作者
Abdelrazek, Aly [1 ]
Eid, Yomna [1 ]
Gawish, Eman [1 ]
Medhat, Walaa [1 ,2 ]
Hassan, Ahmed [1 ]
机构
[1] Nile Univ, Informat Technol & Comp Sci, CIS, Giza, Egypt
[2] Benha Univ, Fac Comp & Artificial intelligence, Banha, Egypt
关键词
Topic modeling; Neural; Probabilistic; Evaluation; LDA; REPRESENTATION;
D O I
10.1016/j.is.2022.102131
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Topic modeling is used in information retrieval to infer the hidden themes in a collection of documents and thus provides an automatic means to organize, understand and summarize large collections of textual information. Topic models also offer an interpretable representation of documents used in several downstream Natural Language Processing (NLP) tasks. Modeling techniques vary from probabilistic graphical models to the more recent neural models. This paper surveys topic models from four aspects. The first aspect categorizes different topic modeling techniques into four categories: algebraic, fuzzy, probabilistic, and neural. We review the wide variety of available models from each category, highlight differences and similarities between models and model categories using a unified perspective, investigate these models' characteristics and limitations, and discuss their proper use cases. The second aspect illustrates six criteria for proper evaluation of topic models, from modeling quality to interpretability, stability, efficiency, and beyond. Topic modeling has found applications in various disciplines, owing to its interpretability. We examine these applications along with some popular software tools which provide an implementation of some models. The fourth aspect reviews available datasets and benchmarks. Using two benchmark datasets, we conducted experiments to compare seven topic models along the proposed metrics. The discussion highlights the differences between the models and their relative suitability for various applications. It notes the relationship between evaluation metrics and proposes four key aspects to help decide which model to use for an application. Our discussion also shows that the research trends move towards developing and tuning neural topic models and leveraging the power of pre-trained language models. Finally, it highlights research gaps in developing unified benchmarks and evaluation metrics. (c) 2022 Elsevier Ltd. All rights reserved.
引用
收藏
页数:17
相关论文
共 110 条
[61]   Learning the parts of objects by non-negative matrix factorization [J].
Lee, DD ;
Seung, HS .
NATURE, 1999, 401 (6755) :788-791
[62]   Topic Modeling for Short Texts with Auxiliary Word Embeddings [J].
Li, Chenliang ;
Wang, Haoran ;
Zhang, Zhiqian ;
Sun, Aixin ;
Ma, Zongyang .
SIGIR'16: PROCEEDINGS OF THE 39TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2016, :165-174
[63]  
Li W., 2006, Proceedings of the 23rd international conference on Machine learning, P577, DOI DOI 10.1145/1143844.1143917
[64]  
Likhitha S., 2019, International Journal of Computer Applications, V178, P1, DOI DOI 10.5120/IJCA2019919265
[65]  
Lim K. W., 2015, PMLR
[66]   An overview of topic modeling and its current applications in bioinformatics [J].
Liu, Lin ;
Tang, Lin ;
Dong, Wen ;
Yao, Shaowen ;
Zhou, Wei .
SPRINGERPLUS, 2016, 5
[67]   Applying LDA Topic Modeling in Communication Research: Toward a Valid and Reliable Methodology [J].
Maier, Daniel ;
Waldherr, A. ;
Miltner, P. ;
Wiedemann, G. ;
Niekler, A. ;
Keinert, A. ;
Pfetsch, B. ;
Heyer, G. ;
Reber, U. ;
Haeussler, T. ;
Schmid-Petri, H. ;
Adam, S. .
COMMUNICATION METHODS AND MEASURES, 2018, 12 (2-3) :93-118
[68]   Measuring LDA Topic Stability from Clusters of Replicated Runs [J].
Mantyla, Mika V. ;
Claes, Maelick ;
Farooq, Umar .
PROCEEDINGS OF THE 12TH ACM/IEEE INTERNATIONAL SYMPOSIUM ON EMPIRICAL SOFTWARE ENGINEERING AND MEASUREMENT (ESEM 2018), 2018,
[69]  
Marjanen J., 2020, ARXIV
[70]   Accelerated Hierarchical Density Based Clustering [J].
McInnes, Leland ;
Healy, John .
2017 17TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2017), 2017, :33-42