Latent Dirichlet allocation

被引:24707
作者
Blei, DM [1 ]
Ng, AY
Jordan, MI
机构
[1] Univ Calif Berkeley, Div Comp Sci, Berkeley, CA 94720 USA
[2] Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA
[3] Univ Calif Berkeley, Div Comp Sci, Berkeley, CA 94720 USA
[4] Univ Calif Berkeley, Dept Stat, Berkeley, CA 94720 USA
关键词
D O I
10.1162/jmlr.2003.3.4-5.993
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
引用
收藏
页码:993 / 1022
页数:30
相关论文
共 32 条
[1]  
Abramowitz M., 1970, HDB MATH FUNCTIONS
[2]  
Aldous D. J, 1985, LECT NOTES MATH, V1117, DOI DOI 10.1007/BFB0099421
[3]  
[Anonymous], 1999, CLAIMING PLACE P 15
[4]  
ATTIAS H, 2000, ADV NEURAL INFORMATI, V12
[5]  
AVERY L, 2002, CAENORRHABDITIS GENE
[6]  
BAEZAYATES RA, 1999, MODERN INFORMATION R
[7]  
BLEI D, 2002, UCBCSD021202 UC BERK
[8]  
de Finetti B., 1990, Theory of probability, V1
[9]  
DEERWESTER S, 1990, J AM SOC INFORM SCI, V41, P391, DOI 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO
[10]  
2-9