A Topic Model Based on Poisson Decomposition

被引:8
作者
Jiang, Haixin [1 ,4 ]
Zhou, Rui [2 ]
Zhang, Limeng [1 ,4 ]
Wang, Hua [3 ]
Zhang, Yanchun [3 ,4 ]
机构
[1] Univ Chinese Acad Sci, Sch Comp & Control Engn, Beijing, Peoples R China
[2] Swinburne Univ Technol, Dept Comp Sci & Software Engn, Melbourne, Vic, Australia
[3] Victoria Univ, Ctr Appl Informat, Melbourne, Vic, Australia
[4] Fudan Univ, Sch Comp Sci, Shanghai, Peoples R China
来源
CIKM'17: PROCEEDINGS OF THE 2017 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT | 2017年
基金
中国国家自然科学基金; 澳大利亚研究理事会;
关键词
Topic model; Poisson decomposition; statistical testing; text classification; topic coherence;
D O I
10.1145/3132847.3132942
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Determining appropriate statistical distributions for modeling text corpora is important for accurate estimation of numerical characteristics. Based on the validity of the test on a claim that the data conforms to Poisson distribution we propose Poisson decomposition model (PDM), a statistical model for modeling count data of text corpora, which can straightly capture each document's multidimensional numerical characteristics on topics. In PDM, each topic is represented as a parameter vector with multidimensional Poisson distribution, which can be easily normalized to multinomial term probabilities and each document is represented as measurements on topics and thereby reduced to a measurement vector on topics. We use gradient descent methods and sampling algorithm for parameter estimation. We carry out extensive experiments on the topics produced by our models. The results demonstrate our approach can extract more coherent topics and is competitive in document clustering by using the PDM-based features, compared to PLSI and LDA.
引用
收藏
页码:1489 / 1498
页数:10
相关论文
共 46 条
[1]  
[Anonymous], 2008, Testing Statistical Hypotheses
[2]  
[Anonymous], 1997, Statistical methods for speech recognition
[3]  
[Anonymous], 1985, E SCH ETE PROBABILIT
[4]  
[Anonymous], 2015, RecSys
[5]  
[Anonymous], 1938, IZVESTIYA AKAD NAU M
[6]  
[Anonymous], 2007, Google news personalization: scalable online collaborative filtering, DOI DOI 10.1145/1242572.1242610
[7]  
[Anonymous], JOINT SENTIMENT TOPI
[8]  
Asuncion Arthur, 2009, ON SMOOTHING INFEREN, P27
[9]   Probabilistic Topic Models [J].
Blei, David M. .
COMMUNICATIONS OF THE ACM, 2012, 55 (04) :77-84
[10]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022