Optimal estimation of sparse topic models

被引：0

作者：

Bing, Xin ^{[1
]}

Bunea, Florentina ^{[1
]}

Wegkamp, Marten ^{[2
]}

机构：

[1] Department of Statistics and Data Science, Cornell University, Ithaca,NY,14850, United States

[2] Department of Mathematics and Department of Statistics and Data Science, Cornell University, Ithaca,NY,14850, United States

来源：

Journal of Machine Learning Research | 2020年 / 21卷

基金：

美国国家科学基金会;

关键词：

Matrix algebra - Sampling;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Topic models have become popular tools for dimension reduction and exploratory analysis of text data which consists in observed frequencies of a vocabulary of p words in n documents, stored in a p × n matrix. The main premise is that the mean of this data matrix can be factorized into a product of two non-negative matrices: a p × K word-topic matrix A and a K × n topic-document matrix W. This paper studies the estimation of A that is possibly element-wise sparse, and the number of topics K is unknown. In this under-explored context, we derive a new minimax lower bound for the estimation of such A and propose a new computationally efficient algorithm for its recovery. We derive a finite sample upper bound for our estimator, and show that it matches the minimax lower bound in many scenarios. Our estimate adapts to the unknown sparsity of A and our analysis is valid for any finite n, p, K and document lengths. Empirical results on both synthetic data and semi-synthetic data show that our proposed estimator is a strong competitor of the existing state-of-the-art algorithms for both non-sparse A and sparse A, and has superior performance is many scenarios of interest. © 2020 Xin Bing, Florentina Bunea and Marten Wegkamp.

引用

共 50 条

[1] Optimal Estimation of Sparse Topic Models
Bing, Xin
Bunea, Florentina
Wegkamp, Marten
JOURNAL OF MACHINE LEARNING RESEARCH, 2020, 21
[2] LIKELIHOOD ESTIMATION OF SPARSE TOPIC DISTRIBUTIONS IN TOPIC MODELS AND ITS APPLICATIONS TO WASSERSTEIN DOCUMENT DISTANCE CALCULATIONS
Bing, Xin
Bunea, Florentina
Strimas-mackey, Seth
Wegkamp, Marten
ANNALS OF STATISTICS, 2022, 50 (06): : 3307 - 3333
[3] SPARSE TOPIC MODELS BY PARAMETER SHARING
Soleimani, Hossein
Miller, David J.
2014 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2014,
[4] Additive Regularization of Topic Models for Topic Selection and Sparse Factorization
Vorontsov, Konstantin
Potapenko, Anna
Plavin, Alexander
STATISTICAL LEARNING AND DATA SCIENCES, 2015, 9047 : 193 - 202
[5] Optimal estimation of rejection thresholds for topic spotting
Subramanian, Krishna
Prasad, Rohit
Natarajan, Prem
Schwartz, Richard
2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 81 - +
[6] Optimal designs in sparse linear models
Yimin Huang
Xiangshun Kong
Mingyao Ai
Metrika, 2020, 83 : 255 - 273
[7] Optimal designs in sparse linear models
Huang, Yimin
Kong, Xiangshun
Ai, Mingyao
METRIKA, 2020, 83 (02) : 255 - 273
[8] Sparse Partially Collapsed MCMC for Parallel Inference in Topic Models
Magnusson, Mans
Jonsson, Leif
Villani, Mattias
Broman, David
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2018, 27 (02) : 449 - 463
[9] Sparse Parallel Training of Hierarchical Dirichlet Process Topic Models
Terenin, Alexander
Magnusson, Mans
Jonsson, Leif
PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 2925 - 2934
[10] Table Topic Models for Hidden Unit Estimation
Yoshida, Minoru
Matsumoto, Kazuyuki
Kita, Kenji
INFORMATION RETRIEVAL TECHNOLOGY, AIRS 2016, 2016, 9994 : 302 - 307

← 1 2 3 4 5 →