Distributed Latent Dirichlet Allocation on Streams

被引:1
|
作者
Guo, Yunyan [1 ]
Li, Jianzhong [1 ,2 ]
机构
[1] Harbin Inst Technol, 92 Xidazhi St, Harbin 15001, Heilongjiang, Peoples R China
[2] Chinese Acad Sci, Shenzhen Inst Adv Technol, Shenzhen 518055, Guangdong, Peoples R China
基金
中国国家自然科学基金;
关键词
Distributed streams; learning system; variational inference; VARIATIONAL INFERENCE; OPTIMIZATION; BURSTY;
D O I
10.1145/3451528
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Latent Dirichlet Allocation (LDA) has been widely used for topic modeling, with applications spanning various areas such as natural language processing and information retrieval. While LDA on small and static datasets has been extensively studied, several real-world challenges are posed in practical scenarios where datasets are often huge and are gathered in a streaming fashion. As the state-of-the-art LDA algorithm on streams, Streaming Variational Bayes (SVB) introduced Bayesian updating to provide a streaming procedure. However, the utility of SVB is limited in applications since it ignored three challenges of processing real-world streams: topic evolution, data turbulence, and real-time inference. In this article, we propose a novel distributed LDA algorithm-referred to as StreamFed-LDA-to deal with challenges on streams. For topic modeling of streaming data, the ability to capture evolving topics is essential for practical online inference. To achieve this goal, StreamFed-LDA is based on a specialized framework that supports lifelong (continual) learning of evolving topics. On the other hand, data turbulence is commonly present in streams due to real-life events. In that case, the design of StreamFed-LDA allows the model to learn new characteristics fromthe most recent data while maintaining the historical information. On massive streaming data, it is difficult and crucial to provide real-time inference results. To increase the throughput and reduce the latency, StreamFed-LDA introduces additional techniques that substantially reduce both computation and communication costs in distributed systems. Experiments on four real-world datasets show that the proposed framework achieves significantly better performance of online inference compared with the baselines. At the same time, StreamFed-LDA also reduces the latency by orders of magnitudes in real-world datasets.
引用
收藏
页数:20
相关论文
共 50 条
  • [1] Mining Sentiments from Songs Using Latent Dirichlet Allocation
    Sharma, Govind
    Murty, M. Narasimha
    ADVANCES IN INTELLIGENT DATA ANALYSIS X: IDA 2011, 2011, 7014 : 328 - 339
  • [2] An Ontology Term Extracting Method Based on Latent Dirichlet Allocation
    Yu Jing
    Wang Junli
    Zhao Xiaodong
    2012 FOURTH INTERNATIONAL CONFERENCE ON MULTIMEDIA INFORMATION NETWORKING AND SECURITY (MINES 2012), 2012, : 366 - 369
  • [3] Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation
    Foulds, James
    Boyles, Levi
    DuBois, Christopher
    Smyth, Padhraic
    Welling, Max
    19TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD'13), 2013, : 446 - 454
  • [4] Efficient Collapsed Gibbs Sampling For Latent Dirichlet Allocation
    Xiao, Han
    Stibor, Thomas
    PROCEEDINGS OF 2ND ASIAN CONFERENCE ON MACHINE LEARNING (ACML2010), 2010, 13 : 63 - 78
  • [5] An analytical code quality methodology using Latent Dirichlet Allocation and Convolutional Neural Networks
    Sorour, Shaymaa E.
    Abdelkader, Hanan E.
    Sallam, Karam M.
    Chakrabortty, Ripon K.
    Ryan, Michael J.
    Abohany, Amr
    JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2022, 34 (08) : 5979 - 5997
  • [6] Use of Latent Dirichlet Allocation and Structural Equation Modeling in Determining the Factors for Continuance Intention of Knowledge Payment Platform
    Xu, Heng
    Zhang, Menglu
    Zeng, Jun
    Hao, Huihui
    Lin, Hao-Chiang Koong
    Xiao, Mengyun
    SUSTAINABILITY, 2022, 14 (15)
  • [7] Dirichlet process mixture models for non-stationary data streams
    Casado, Ioar
    Perez, Aritz
    2022 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2022, : 873 - 878
  • [8] Accelerated Distributed Allocation
    Doostmohammadian, Mohammadreza
    Aghasi, Alireza
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 651 - 655
  • [9] WhiteWater: Distributed processing of fast streams
    Stanoi, Ioana
    Mihaila, George A.
    Palpanas, Themis
    Lang, Christian A.
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2007, 19 (09) : 1214 - 1226
  • [10] Non-parametric genetic prediction of complex traits with latent Dirichlet process regression models
    Zeng, Ping
    Zhou, Xiang
    NATURE COMMUNICATIONS, 2017, 8