Sparse Partially Collapsed MCMC for Parallel Inference in Topic Models

被引:8
作者
Magnusson, Mans [1 ]
Jonsson, Leif [1 ,2 ]
Villani, Mattias [1 ]
Broman, David [3 ]
机构
[1] Linkoping Univ, Dept Comp & Informat Sci, S-58183 Linkoping, Sweden
[2] Ericsson AB, Stockholm, Sweden
[3] KTH Royal Inst Technol, Sch Elect Engn & Comp Sci, Stockholm, Sweden
关键词
Bayesian inference; Computational complexity; Gibbs sampling; Latent Dirichlet allocation; Massive datasets; Parallel computing;
D O I
10.1080/10618600.2017.1366913
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Topic models, and more specifically the class of latent Dirichlet allocation (LDA), are widely used for probabilistic modeling of text. Markov chain Monte Carlo (MCMC) sampling from the posterior distribution is typically performed using a collapsed Gibbs sampler. We propose a parallel sparse partially collapsed Gibbs sampler and compare its speed and efficiency to state-of-the-art samplers for topic models on five well-known text corpora of differing sizes and properties. In particular, we propose and compare two different strategies for sampling the parameter block with latent topic indicators. The experiments show that the increase in statistical inefficiency from only partial collapsing is smaller than commonly assumed, and can be more than compensated by the speedup from parallelization and sparsity on larger corpora. We also prove that the partially collapsed samplers scale well with the size of the corpus. The proposed algorithm is fast, efficient, exact, and can be used in more modeling situations than the ordinary collapsed sampler. Supplementary materials for this article are available online.
引用
收藏
页码:449 / 463
页数:15
相关论文
共 39 条
[1]  
Ahn S, 2014, PR MACH LEARN RES, V32, P1044
[2]  
[Anonymous], 2008, P 14 ACM SIGKDD INT, DOI DOI 10.1145/1401890.1401960
[3]  
[Anonymous], 2002, MALLET: A machine learning for language toolkit
[4]  
[Anonymous], 2012, PROC 5 ACM INT C WEB
[5]  
[Anonymous], 2000, ACM 2000 C JAVA GRAN, DOI 10.1145/337449.337465
[6]  
Araujo M., 1997, PROC WSP 97 VALPARAI, P2
[7]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[8]   Bayesian Sparse Topic Model [J].
Chien, Jen-Tzung ;
Chang, Ying-Lan .
JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2014, 74 (03) :375-389
[9]  
Gao Jianfeng., 2008, Proceedings of EMNLP-2008, P344, DOI DOI 10.3115/1613715.1613761
[10]   Finding scientific topics [J].
Griffiths, TL ;
Steyvers, M .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2004, 101 :5228-5235