Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA

被引:166
作者
Lu, Yue [1 ]
Mei, Qiaozhu [2 ]
Zhai, ChengXiang [1 ]
机构
[1] Univ Illinois, Dept Comp Sci, Urbana, IL 61801 USA
[2] Univ Michigan, Sch Informat, Ann Arbor, MI 48109 USA
来源
INFORMATION RETRIEVAL | 2011年 / 14卷 / 02期
关键词
Evaluation; Topic models; LDA; PLSA; Experimentation; Performance;
D O I
10.1007/s10791-010-9141-9
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Probabilistic topic models have recently attracted much attention because of their successful applications in many text mining tasks such as retrieval, summarization, categorization, and clustering. Although many existing studies have reported promising performance of these topic models, none of the work has systematically investigated the task performance of topic models; as a result, some critical questions that may affect the performance of all applications of topic models are mostly unanswered, particularly how to choose between competing models, how multiple local maxima affect task performance, and how to set parameters in topic models. In this paper, we address these questions by conducting a systematic investigation of two representative probabilistic topic models, probabilistic latent semantic analysis (PLSA) and Latent Dirichlet Allocation (LDA), using three representative text mining tasks, including document clustering, text categorization, and ad-hoc retrieval. The analysis of our experimental results provides deeper understanding of topic models and many useful insights about how to optimize the performance of topic models for these typical tasks. The task-based evaluation framework is generalizable to other topic models in the family of either PLSA or LDA.
引用
收藏
页码:178 / 203
页数:26
相关论文
共 30 条
[1]  
[Anonymous], 2007, Probabilistic Topic Models
[2]  
[Anonymous], 2008, P 14 ACM SIGKDD INT
[3]  
[Anonymous], NEURAL INFORM PROCES
[4]  
[Anonymous], 2004, Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD), DOI [10.1145/1014052, DOI 10.1145/1014052]
[5]  
[Anonymous], 2008, Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM '08
[6]  
Blei D. M., 2004, P NIPS
[7]  
Blei DavidM., 2005, NIPS
[8]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[9]  
Chengxiang Zhai, 2001, Proceedings of the 2001 ACM CIKM. Tenth International Conference on Information and Knowledge Management, P403, DOI 10.1145/502585.502654
[10]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38