FastBTM: Reducing the sampling time for biterm topic model

被引:18
作者
He, Xingwei [1 ]
Xu, Hua [1 ]
Li, Jia [1 ]
He, Liu [2 ]
Yu, Linlin [3 ]
机构
[1] Tsinghua Univ, Dept Comp Sci & Technol, State Key Lab Intelligent Technol & Syst, Tsinghua Natl Lab Informat Sci & Technol, Beijing 100084, Peoples R China
[2] Beijing Univ Posts & Telecommun, Beijing 100876, Peoples R China
[3] Shanghai Jiao Tong Univ, Shanghai 200240, Peoples R China
基金
中国国家自然科学基金;
关键词
BTM; Topic model; Alias method; Metropolis-Hastings; Acceleration algorithm; MARKOV-CHAINS; ALGORITHM;
D O I
10.1016/j.knosys.2017.06.005
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Due to the popularity of social networks, such as microblogs and Twitter, a vast amount of short text data is created every day. Much recent research in short text becomes increasingly significant, such as topic inference for short text. Biterm topic model (BTM) benefits from the word co-occurrence patterns of the corpus, which makes it perform better than conventional topic models in uncovering latent semantic relevance for short text. However, BTM resorts to Gibbs sampling to infer topics, which is very time consuming, especially for large-scale datasets or when the number of topics is extremely large. It requires 0(K) operations per sample for K topics, where K denotes the number of topics in the corpus. In this paper, we propose an acceleration algorithm of BTM, FastBTM, using an efficient sampling method for BTM, which converges much faster than BTM without degrading topic quality. FastBTM is based on Metropolis Hastings and alias method, both of which have been widely adopted in Latent Dirichlet Allocation (LDA) model and achieved outstanding speedup. Our FastBTM can effectively reduce the sampling complexity of biterm topic model from 0(K) to 0(1) amortized time. We carry out a number of experiments on three datasets including two short text datasets, Tweets2011 Collection dataset and Yahoo! Answers dataset, and one long document dataset, Enron dataset. Our experimental results show that when the number of topics K increases, the gap in running time speed between FastBTM and BTM gets especially larger. In addition, our FastBTM is effective for both short text datasets and long document datasets. (C) 2017 Elsevier B.V. All rights reserved.
引用
收藏
页码:11 / 20
页数:10
相关论文
共 46 条
[1]   An introduction to MCMC for machine learning [J].
Andrieu, C ;
de Freitas, N ;
Doucet, A ;
Jordan, MI .
MACHINE LEARNING, 2003, 50 (1-2) :5-43
[2]  
[Anonymous], INTELLIGENT SOFTWARE
[3]  
[Anonymous], 2002, TECHNICAL REPORT
[4]  
[Anonymous], 2013, P 22 INT C WORLD WID, DOI DOI 10.1145/2487788.2488002
[5]   A hybrid approach to the sentiment analysis problem at the sentence level [J].
Appel, Orestes ;
Chiclana, Francisco ;
Carter, Jenny ;
Fujita, Hamido .
KNOWLEDGE-BASED SYSTEMS, 2016, 108 :110-124
[6]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[7]  
Cambria, 2015, FLAIRS C, P311
[8]  
Chen J., 2016, VLDB, P744
[9]  
Chen WZ, 2015, PROCEEDINGS OF THE 53RD ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL) AND THE 7TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (IJCNLP), VOL 2, P489
[10]   BTM: Topic Modeling over Short Texts [J].
Cheng, Xueqi ;
Yan, Xiaohui ;
Lan, Yanyan ;
Guo, Jiafeng .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (12) :2928-2941