Scalable Training of Hierarchical Topic Models

被引:11
|
作者
Chen, Jianfei [1 ]
Zhu, Jun [1 ]
Lu, Jie [2 ]
Liu, Shixia [2 ]
机构
[1] Tsinghua Univ, BNRist Ctr, State Key Lab Intell Tech & Sys, Dept Comp Sci & Tech, Beijing 100084, Peoples R China
[2] Tsinghua Univ, BNRist Ctr, State Key Lab Intell Tech & Sys, Sch Software, Beijing 100084, Peoples R China
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2018年 / 11卷 / 07期
基金
北京市自然科学基金;
关键词
DIRICHLET; INFERENCE;
D O I
10.14778/3192965.3192972
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large-scale topic models serve as basic tools for feature extraction and dimensionality reduction in many practical applications. As a natural extension of flat topic models, hierarchical topic models (HTMs) are able to learn topics of different levels of abstraction, which lead to deeper understanding and better generalization than their flat counterparts. However, existing scalable systems for flat topic models cannot handle HTMs, due to their complicated data structures such as trees and concurrent dynamically growing matrices, as well as their susceptibility to local optima. In this paper, we study the hierarchical latent Dirichlet allocation (hLDA) model which is a powerful nonparametric Bayesian HTM. We propose an efficient partially collapsed Gibbs sampling algorithm for hLDA, as well as an initialization strategy to deal with local optima introduced by tree-structured models. We also identify new system challenges in building scalable systems for HTMs, and propose efficient data layout for vectorizing HTM as well as distributed data structures including dynamic matrices and trees. Empirical studies show that our system is 87 times more efficient than the previous open-source implementation for hLDA, and can scale to thousands of CPU cores. We demonstrate our scalability on a 131-million-document corpus with 28 billion tokens, which is 4-5 orders of magnitude larger than previously used corpus. Our distributed implementation can extract 1,722 topics from the corpus with 50 machines in just 7 hours.
引用
收藏
页码:826 / 839
页数:14
相关论文
共 50 条
  • [21] Sensitivity Analysis for Bayesian Hierarchical Models
    Roos, Malgorzata
    Martins, Thiago G.
    Held, Leonhard
    Rue, Havard
    BAYESIAN ANALYSIS, 2015, 10 (02): : 321 - 349
  • [22] Criterion constrained Bayesian hierarchical models
    Zong, Qingying
    Bradley, Jonathan R.
    TEST, 2023, 32 (01) : 294 - 320
  • [23] Inference and Learning with Hierarchical Shape Models
    Kokkinos, Iasonas
    Yuille, Alan
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2011, 93 (02) : 201 - 225
  • [24] Bayesian hierarchical statistical SIRS models
    Zhuang, Lili
    Cressie, Noel
    STATISTICAL METHODS AND APPLICATIONS, 2014, 23 (04) : 601 - 646
  • [25] Scalable Counterfactual Distribution Estimation in Multivariate Causal Models
    Thong Pham
    Shimizu, Shohei
    Hino, Hideitsu
    Le, Tam
    CAUSAL LEARNING AND REASONING, VOL 236, 2024, 236 : 1118 - 1140
  • [26] Multistage hierarchical capture-recapture models
    Hooten, Mevin B.
    Schwob, Michael R.
    Johnson, Devin S.
    Ivan, Jacob S.
    ENVIRONMETRICS, 2023, 34 (06)
  • [27] An objective prior for hyperparameters in normal hierarchical models
    Berger, James O.
    Sun, Dongchu
    Song, Chengyuan
    JOURNAL OF MULTIVARIATE ANALYSIS, 2020, 178
  • [28] Linear Time Samplers for Supervised Topic Models using Compositional Proposals
    Zheng, Xun
    Yu, Yaoliang
    Xing, Eric P.
    KDD'15: PROCEEDINGS OF THE 21ST ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2015, : 1523 - 1532
  • [29] ONLINE TIME-DEPENDENT CLUSTERING USING PROBABILISTIC TOPIC MODELS
    Renard, Benjamin
    Kharratzadeh, Milad
    Coates, Mark
    2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 2036 - 2040
  • [30] Toward a direct and scalable identification of reduced models for categorical processes
    Gerber, Susanne
    Horenko, Illia
    PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2017, 114 (19) : 4863 - 4868