A fast algorithm with minimax optimal guarantees for topic models with an unknown number of topics

被引:15
作者
Bing, Xin [1 ]
Bunea, Florentina [1 ]
Wegkamp, Marten [1 ,2 ]
机构
[1] Cornell Univ, Dept Stat & Data Sci, Ithaca, NY 14850 USA
[2] Cornell Univ, Dept Math, White Hall, Ithaca, NY 14853 USA
关键词
adaptive estimation; anchor words; high dimensional estimation; identification; latent model; minimax estimation; nonnegative matrix factorization; overlapping clustering; separability; topic model;
D O I
10.3150/19-BEJ1166
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Topic models have become popular for the analysis of data that consists in a collection of n independent multinomial observations, with parameters N-i is an element of N and Pi(i) is an element of [0, 1](P) for i = 1, ..., n. The model links all cell probabilities, collected in a p x n matrix Pi, via the assumption that II can be factorized as the product of two nonnegative matrices A is an element of [0, 1](PxK) and W is an element of [0,1] (Kxn). Topic models have been originally developed in text mining, when one browses through n documents, based on a dictionary of p words, and covering K topics. In this terminology, the matrix A is called the word-topic matrix, and is the main target of estimation. It can be viewed as a matrix of conditional probabilities, and it is uniquely defined, under appropriate separability assumptions, discussed in detail in this work. Notably, the unique A is required to satisfy what is commonly known as the anchor word assumption, under which A has an unknown number of rows respectively proportional to the canonical basis vectors in R-K. The indices of such rows are referred to as anchor words. Recent computationally feasible algorithms, with theoretical guarantees, utilize constructively this assumption by linking the estimation of the set of anchor words with that of estimating the K vertices of a simplex. This crucial step in the estimation of A requires K to be known, and cannot be easily extended to the more realistic set-up when K is unknown. This work takes a different view on anchor word estimation, and on the estimation of A. We propose a new method of estimation in topic models, that is not a variation on the existing simplex finding algorithms, and that estimates K from the observed data. We derive new finite sample minimax lower bounds for the estimation of A, as well as new upper bounds for our proposed estimator. We describe the scenarios where our estimator is minimax adaptive. Our finite sample analysis is valid for any n, N-i, p and K, and both p and K are allowed to increase with n, a situation not handled well by previous analyses. We complement our theoretical results with a detailed simulation study. We illustrate that the new algorithm is faster and more accurate than the current ones, although we start out with a computational and theoretical disadvantage of not knowing the correct number of topics K, while we provide the competing methods with the correct value in our simulations.
引用
收藏
页码:1765 / 1796
页数:32
相关论文
共 22 条
[1]  
Anandkumar Anima, 2012, Advances in neural information processing systems, P917
[2]  
[Anonymous], 1999, CLAIMING PLACE P 15
[3]  
[Anonymous], FACTORING NONNEGATIV
[4]  
[Anonymous], SPARSE LATENT FACTOR
[5]  
[Anonymous], NEW SVD APPROACH OPT
[6]  
[Anonymous], FAST ALGORITHM MIN S
[7]   Learning Topic Models - Going beyond SVD [J].
Arora, Sanjeev ;
Ge, Rong ;
Moitra, Ankur .
2012 IEEE 53RD ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE (FOCS), 2012, :1-10
[8]  
Arora Sanjeev, 2013, JMLR Workshop and Conference Proceedings, V28, P280
[9]  
Bansal T, 2014, ADV NEUR IN, V27
[10]   A CORRELATED TOPIC MODEL OF SCIENCE [J].
Blei, David M. ;
Lafferty, John D. .
ANNALS OF APPLIED STATISTICS, 2007, 1 (01) :17-35