A new anchor word selection method for the separable topic discovery

被引:2
作者
He, Kun [1 ]
Wang, Wu [1 ]
Wang, Xiaosen [1 ]
Hopcroft, John E. [2 ]
机构
[1] Huazhong Univ Sci & Technol, Sch Comp Sci & Technol, Wuhan 430074, Hubei, Peoples R China
[2] Cornell Univ, Dept Comp Sci, Ithaca, NY 14853 USA
基金
中国国家自然科学基金;
关键词
anchor word; graph; separability; topic modeling; word similarity; NONNEGATIVE MATRIX FACTORIZATION;
D O I
10.1002/widm.1313
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Separable nonnegative matrix factorization (SNMF) is an important method for topic modeling, where "separable" assumes every topic contains at least one anchor word, defined as a word that has non-zero probability only on that topic. SNMF focuses on the word co-occurrence patterns to reveal topics by two steps: anchor word selection and topic recovery. The quality of the anchor words strongly influences the quality of the extracted topics. Existing anchor word selection algorithm is to greedily find an approximate convex hull in a high-dimensional word co-occurrence space. In this work, we propose a new method for the anchor word selection by associating the word co-occurrence probability with the words similarity and assuming that the most different words on semantic are potential candidates for the anchor words. Therefore, if the similarity of a word-pair is very low, the two words are very likely to be the anchor words. According to the statistical information of text corpora, we can get the similarity of all word-pairs. We build the word similarity graph where the nodes correspond to words and weights on edges stand for the word-pair similarity. Following this way, we design a greedy method to find a minimum edge-weight anchor clique of a given size in the graph for the anchor word selection. Extensive experiments on real-world corpus demonstrate the effectiveness of the proposed anchor word selection method that outperforms the common convex hull-based methods on the revealed topic quality. Meanwhile, our method is much faster than typical SNMF-based method. This article is categorized under: Algorithmic Development > Text Mining Technologies > Machine Learning
引用
收藏
页数:10
相关论文
共 28 条
  • [1] [Anonymous], 2010, UCI machine learning repository
  • [2] [Anonymous], 2001, P 20 ACM SIGMOD SIGA
  • [3] [Anonymous], 2013, INT C MACHINE LEARNI
  • [4] Learning Topic Models - Going beyond SVD
    Arora, Sanjeev
    Ge, Rong
    Moitra, Ankur
    [J]. 2012 IEEE 53RD ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE (FOCS), 2012, : 1 - 10
  • [5] Latent Dirichlet allocation
    Blei, DM
    Ng, AY
    Jordan, MI
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) : 993 - 1022
  • [6] Chen J, 2013, 2013 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND BIOMIMETICS (ROBIO), P2446, DOI 10.1109/ROBIO.2013.6739838
  • [7] Donoho D, 2004, ADV NEUR IN, V16, P1141
  • [8] Foulds J, 2013, 19TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD'13), P446
  • [9] Gillis N, 2012, J MACH LEARN RES, V13, P3349
  • [10] Finding scientific topics
    Griffiths, TL
    Steyvers, M
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2004, 101 : 5228 - 5235