Topic Modeling in Embedding Spaces

被引:390
作者
Dieng, Adji B. [1 ]
Ruiz, Francisco J. R. [2 ]
Blei, David M. [1 ]
机构
[1] Columbia Univ, New York, NY 10027 USA
[2] DeepMind, London, England
基金
欧盟地平线“2020”;
关键词
LANGUAGE;
D O I
10.1162/tacl_a_00325
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Topic modeling analyzes documents to learn meaningful patterns of words. However, existing topic models fail to learn interpretable topics when working with large and heavy-tailed vocabularies. To this end, we develop the embedded topic model (ETM), a generative model of documents that marries traditional topic models with word embeddings. More specifically, the ETM models each word with a categorical distribution whose natural parameter is the inner product between the word's embedding and an embedding of its assigned topic. To fit the ETM, we develop an efficient amortized variational inference algorithm. The ETM discovers interpretable topics even with large vocabularies that include rare words and stop words. It outperforms existing document models, such as latent Dirichlet allocation, in terms of both topic quality and predictive performance.
引用
收藏
页码:439 / 453
页数:15
相关论文
共 57 条
[1]   LOGISTIC-NORMAL DISTRIBUTIONS - SOME PROPERTIES AND USES [J].
AITCHISON, J ;
SHEN, SM .
BIOMETRIKA, 1980, 67 (02) :261-272
[2]  
[Anonymous], 2009, P BIENN GSCL C, DOI DOI 10.1007/BF02774984
[3]  
Batmanghelich NK, 2016, PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2016), VOL 2, P537, DOI 10.18653/v1/P16-2087
[4]   A neural probabilistic language model [J].
Bengio, Y ;
Ducharme, R ;
Vincent, P ;
Jauvin, C .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (06) :1137-1155
[5]  
Bengio Y, 2006, STUD FUZZ SOFT COMP, V194, P137
[6]   A CORRELATED TOPIC MODEL OF SCIENCE [J].
Blei, David M. ;
Lafferty, John D. .
ANNALS OF APPLIED STATISTICS, 2007, 1 (01) :17-35
[7]   Variational Inference: A Review for Statisticians [J].
Blei, David M. ;
Kucukelbir, Alp ;
McAuliffe, Jon D. .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2017, 112 (518) :859-877
[8]   Probabilistic Topic Models [J].
Blei, David M. .
COMMUNICATIONS OF THE ACM, 2012, 55 (04) :77-84
[9]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[10]  
Boyd-Graber J, 2017, FOUND TRENDS INF RET, V11, P144