Scalable Training of Hierarchical Topic Models

被引:11
|
作者
Chen, Jianfei [1 ]
Zhu, Jun [1 ]
Lu, Jie [2 ]
Liu, Shixia [2 ]
机构
[1] Tsinghua Univ, BNRist Ctr, State Key Lab Intell Tech & Sys, Dept Comp Sci & Tech, Beijing 100084, Peoples R China
[2] Tsinghua Univ, BNRist Ctr, State Key Lab Intell Tech & Sys, Sch Software, Beijing 100084, Peoples R China
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2018年 / 11卷 / 07期
基金
北京市自然科学基金;
关键词
DIRICHLET; INFERENCE;
D O I
10.14778/3192965.3192972
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large-scale topic models serve as basic tools for feature extraction and dimensionality reduction in many practical applications. As a natural extension of flat topic models, hierarchical topic models (HTMs) are able to learn topics of different levels of abstraction, which lead to deeper understanding and better generalization than their flat counterparts. However, existing scalable systems for flat topic models cannot handle HTMs, due to their complicated data structures such as trees and concurrent dynamically growing matrices, as well as their susceptibility to local optima. In this paper, we study the hierarchical latent Dirichlet allocation (hLDA) model which is a powerful nonparametric Bayesian HTM. We propose an efficient partially collapsed Gibbs sampling algorithm for hLDA, as well as an initialization strategy to deal with local optima introduced by tree-structured models. We also identify new system challenges in building scalable systems for HTMs, and propose efficient data layout for vectorizing HTM as well as distributed data structures including dynamic matrices and trees. Empirical studies show that our system is 87 times more efficient than the previous open-source implementation for hLDA, and can scale to thousands of CPU cores. We demonstrate our scalability on a 131-million-document corpus with 28 billion tokens, which is 4-5 orders of magnitude larger than previously used corpus. Our distributed implementation can extract 1,722 topics from the corpus with 50 machines in just 7 hours.
引用
收藏
页码:826 / 839
页数:14
相关论文
共 50 条
  • [41] Fast maximum likelihood estimation for general hierarchical models
    Hong, Johnny
    Stoudt, Sara
    de Valpine, Perry
    JOURNAL OF APPLIED STATISTICS, 2025, 52 (03) : 595 - 623
  • [42] Parametric and Non-parametric User-aware Sentiment Topic Models
    Yang, Zaihan
    Kotov, Alexander
    Mohan, Aravind
    Lu, Shiyong
    SIGIR 2015: PROCEEDINGS OF THE 38TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2015, : 413 - 422
  • [43] Scalable Filtering of Large Graph-Coupled Hidden Markov Models
    Haksar, Ravi N.
    Lorenzetti, Joseph
    Schwager, Mac
    2019 IEEE 58TH CONFERENCE ON DECISION AND CONTROL (CDC), 2019, : 1307 - 1314
  • [44] Scalable Algorithms for Learning High-Dimensional Linear Mixed Models
    Tan, Zilong
    Roche, Kimberly
    Zhou, Xiang
    Mukherjee, Sayan
    UNCERTAINTY IN ARTIFICIAL INTELLIGENCE, 2018, : 259 - 268
  • [45] Semiparametric Bayesian hierarchical models for heterogeneous population in nonlinear mixed effect model: application to gastric emptying studies
    Zhang, Huaiye
    Kim, Inyoung
    Park, Chun Gun
    JOURNAL OF APPLIED STATISTICS, 2014, 41 (12) : 2743 - 2760
  • [46] Posterior Predictive p-values in Bayesian Hierarchical Models
    Steinbakk, Gunnhildur Hoegnadottir
    Storvik, Geir Olve
    SCANDINAVIAN JOURNAL OF STATISTICS, 2009, 36 (02) : 320 - 336
  • [47] Bayesian hierarchical models and prior elicitation for fitting psychometric functions
    Mezzetti, Maura
    Ryan, Colleen P.
    Balestrucci, Priscilla
    Lacquaniti, Francesco
    Moscatelli, Alessandro
    FRONTIERS IN COMPUTATIONAL NEUROSCIENCE, 2023, 17
  • [48] Clustering Hidden Markov Models With Variational Bayesian Hierarchical EM
    Lan, Hui
    Liu, Ziquan
    Hsiao, Janet H.
    Yu, Dan
    Chan, Antoni B.
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (03) : 1537 - 1551
  • [49] Bayesian methods for hierarchical models: Are ecologists making a Faustian bargain?
    Lele, Subhash R.
    Dennis, Brian
    ECOLOGICAL APPLICATIONS, 2009, 19 (03) : 581 - 584