Optimal estimation of sparse topic models

被引:0
|
作者
Bing, Xin [1 ]
Bunea, Florentina [1 ]
Wegkamp, Marten [2 ]
机构
[1] Department of Statistics and Data Science, Cornell University, Ithaca,NY,14850, United States
[2] Department of Mathematics and Department of Statistics and Data Science, Cornell University, Ithaca,NY,14850, United States
基金
美国国家科学基金会;
关键词
Matrix algebra - Sampling;
D O I
暂无
中图分类号
学科分类号
摘要
Topic models have become popular tools for dimension reduction and exploratory analysis of text data which consists in observed frequencies of a vocabulary of p words in n documents, stored in a p × n matrix. The main premise is that the mean of this data matrix can be factorized into a product of two non-negative matrices: a p × K word-topic matrix A and a K × n topic-document matrix W. This paper studies the estimation of A that is possibly element-wise sparse, and the number of topics K is unknown. In this under-explored context, we derive a new minimax lower bound for the estimation of such A and propose a new computationally efficient algorithm for its recovery. We derive a finite sample upper bound for our estimator, and show that it matches the minimax lower bound in many scenarios. Our estimate adapts to the unknown sparsity of A and our analysis is valid for any finite n, p, K and document lengths. Empirical results on both synthetic data and semi-synthetic data show that our proposed estimator is a strong competitor of the existing state-of-the-art algorithms for both non-sparse A and sparse A, and has superior performance is many scenarios of interest. © 2020 Xin Bing, Florentina Bunea and Marten Wegkamp.
引用
收藏
相关论文
共 50 条
  • [1] Optimal Estimation of Sparse Topic Models
    Bing, Xin
    Bunea, Florentina
    Wegkamp, Marten
    JOURNAL OF MACHINE LEARNING RESEARCH, 2020, 21
  • [2] LIKELIHOOD ESTIMATION OF SPARSE TOPIC DISTRIBUTIONS IN TOPIC MODELS AND ITS APPLICATIONS TO WASSERSTEIN DOCUMENT DISTANCE CALCULATIONS
    Bing, Xin
    Bunea, Florentina
    Strimas-mackey, Seth
    Wegkamp, Marten
    ANNALS OF STATISTICS, 2022, 50 (06): : 3307 - 3333
  • [3] SPARSE TOPIC MODELS BY PARAMETER SHARING
    Soleimani, Hossein
    Miller, David J.
    2014 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2014,
  • [4] Additive Regularization of Topic Models for Topic Selection and Sparse Factorization
    Vorontsov, Konstantin
    Potapenko, Anna
    Plavin, Alexander
    STATISTICAL LEARNING AND DATA SCIENCES, 2015, 9047 : 193 - 202
  • [5] Optimal estimation of rejection thresholds for topic spotting
    Subramanian, Krishna
    Prasad, Rohit
    Natarajan, Prem
    Schwartz, Richard
    2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 81 - +
  • [6] Optimal designs in sparse linear models
    Yimin Huang
    Xiangshun Kong
    Mingyao Ai
    Metrika, 2020, 83 : 255 - 273
  • [7] Optimal designs in sparse linear models
    Huang, Yimin
    Kong, Xiangshun
    Ai, Mingyao
    METRIKA, 2020, 83 (02) : 255 - 273
  • [8] Sparse Partially Collapsed MCMC for Parallel Inference in Topic Models
    Magnusson, Mans
    Jonsson, Leif
    Villani, Mattias
    Broman, David
    JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2018, 27 (02) : 449 - 463
  • [9] Sparse Parallel Training of Hierarchical Dirichlet Process Topic Models
    Terenin, Alexander
    Magnusson, Mans
    Jonsson, Leif
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 2925 - 2934
  • [10] Table Topic Models for Hidden Unit Estimation
    Yoshida, Minoru
    Matsumoto, Kazuyuki
    Kita, Kenji
    INFORMATION RETRIEVAL TECHNOLOGY, AIRS 2016, 2016, 9994 : 302 - 307