The Dual-Sparse Topic Model: Mining Focused Topics and Focused Terms in Short Text

被引:92
作者
Lin, Tianyi [1 ]
Tian, Wentao [1 ]
Mei, Qiaozhu [2 ]
Cheng, Hong [1 ]
机构
[1] Chinese Univ Hong Kong, Hong Kong, Peoples R China
[2] Univ Michigan, Sch Informat, Ann Arbor, MI 48109 USA
来源
WWW'14: PROCEEDINGS OF THE 23RD INTERNATIONAL CONFERENCE ON WORLD WIDE WEB | 2014年
基金
美国国家科学基金会;
关键词
Topic modeling; spike and slab; sparse representation; user-generated content;
D O I
10.1145/2566486.2567980
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Topic modeling has been proved to be an effective method for exploratory text mining. It is a common assumption of most topic models that a document is generated from a mixture of topics. In real-world scenarios, individual documents usually concentrate on several salient topics instead of covering a wide variety of topics. A real topic also adopts a narrow range of terms instead of a wide coverage of the vocabulary. Understanding this sparsity of information is especially important for analyzing user-generated Web content and social media, which are featured as extremely short posts and condensed discussions. In this paper, we propose a dual-sparse topic model that addresses the sparsity in both the topic mixtures and the word usage. By applying a "Spike and Slab" prior to decouple the sparsity and smoothness of the document-topic and topic-word distributions, we allow individual documents to select a few focused topics and a topic to select focused terms, respectively. Experiments on different genres of large corpora demonstrate that the dual-sparse topic model out-performs both classical topic models and existing sparsity-enhanced topic models. This improvement is especially notable on collections of short documents.
引用
收藏
页码:539 / 549
页数:11
相关论文
共 32 条
  • [1] [Anonymous], 2007, PROC ANN C NEUR INFO
  • [2] [Anonymous], 2011, International Conference on Artificial Intelligence and Statistics
  • [3] [Anonymous], NIPS WORKSH APPL TOP
  • [4] [Anonymous], ICML
  • [5] [Anonymous], 2006, Advances in Neural Information Processing Systems
  • [6] Archambeau C., 2011, NIPS BAYES NONP WORK
  • [7] Asuncion A., 2009, C UNC ART INT UAI QU, P27, DOI DOI 10.1080/10807030390248483
  • [8] Probabilistic Topic Models
    Blei, David M.
    [J]. COMMUNICATIONS OF THE ACM, 2012, 55 (04) : 77 - 84
  • [9] Latent Dirichlet allocation
    Blei, DM
    Ng, AY
    Jordan, MI
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) : 993 - 1022
  • [10] Blei DM, 2003, ADV NEURAL INF PROCE, V16, P106