Asynchronous distributed estimation of topic models for document analysis

被引:8
|
作者
Asuncion, Arthur U. [1 ]
Smyth, Padhraic [1 ]
Welling, Max [1 ]
机构
[1] Univ Calif Irvine, Dept Comp Sci, Irvine, CA 92717 USA
关键词
Topic model; Distributed learning; Parallelization; Gibbs sampling;
D O I
10.1016/j.stamet.2010.03.002
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Given the prevalence of large data sets and the availability of inexpensive parallel computing hardware, there is significant motivation to explore distributed implementations of statistical learning algorithms. In this paper, we present a distributed learning framework for Latent Dirichlet Allocation (LDA), a well-known Bayesian latent variable model for sparse matrices of count data. In the proposed approach, data are distributed across P processors, and processors independently perform inference on their local data and communicate their sufficient statistics in a local asynchronous manner with other processors. We apply two different approximate inference techniques for LDA, collapsed Gibbs sampling and collapsed variational inference, within a distributed framework. The results show significant improvements in computation time and memory when running the algorithms on very large text corpora using parallel hardware. Despite the approximate nature of the proposed approach, simulations suggest that asynchronous distributed algorithms are able to learn models that are nearly as accurate as those learned by the standard non-distributed approaches. We also find that our distributed algorithms converge rapidly to good solutions. (C) 2010 Elsevier B.V. All rights reserved.
引用
收藏
页码:3 / 17
页数:15
相关论文
共 50 条
  • [1] Generation of Word Clouds Using Document Topic Models
    Sendhilkumar, S.
    Srivani, M.
    Mahalakshmi, G. S.
    2017 SECOND INTERNATIONAL CONFERENCE ON RECENT TRENDS AND CHALLENGES IN COMPUTATIONAL MODELS (ICRTCCM), 2017, : 306 - 308
  • [2] Modeling query-document dependencies with topic language models for information retrieval
    Wu, Meng-Sung
    INFORMATION SCIENCES, 2015, 312 : 1 - 12
  • [3] Bayesian Analysis of Dynamic Linear Topic Models
    Glynn, Chris
    Tokdar, Surya T.
    Howard, Brian
    Banks, David L.
    BAYESIAN ANALYSIS, 2019, 14 (01): : 53 - 80
  • [4] A Phrase Topic Model Based on Distributed Representation
    Ma, Jialin
    Cheng, Jieyi
    Zhang, Lin
    Zhou, Lei
    Chen, Bolun
    CMC-COMPUTERS MATERIALS & CONTINUA, 2020, 64 (01): : 455 - 469
  • [5] A phrase topic model based on distributed representation
    Ma J.
    Cheng J.
    Zhang L.
    Zhou L.
    Chen B.
    Computers, Materials and Continua, 2020, 64 (01) : 455 - 469
  • [6] Topic Mining over Asynchronous Text Sequences
    Wang, Xiang
    Jin, Xiaoming
    Chen, Meng-En
    Zhang, Kai
    Shen, Dou
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2012, 24 (01) : 156 - 169
  • [7] Noise Document Detection for Document Retrieval Based on Topic Match
    Noh, Yunseok
    Park, Seong-Bae
    ADVANCED SCIENCE LETTERS, 2017, 23 (10) : 9478 - 9481
  • [8] Topic Models with Topic Ordering Regularities for Topic Segmentation
    Du, Lan
    Pate, John K.
    Johnson, Mark
    2014 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2014, : 803 - 808
  • [9] Social-Network Analysis Using Topic Models
    Cha, Youngchul
    Cho, Junghoo
    SIGIR 2012: PROCEEDINGS OF THE 35TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2012, : 565 - 574
  • [10] Incorporating Popularity in Topic Models for Social Network Analysis
    Cha, Youngchul
    Bi, Bin
    Hsieh, Chu-Cheng
    Cho, Junghoo
    SIGIR'13: THE PROCEEDINGS OF THE 36TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH & DEVELOPMENT IN INFORMATION RETRIEVAL, 2013, : 223 - 232