Scalable Topical Phrase Mining from Text Corpora

被引:116
作者
El-Kishky, Ahmed [1 ]
Song, Yanglei [1 ]
Wang, Chi [2 ]
Voss, Clare R. [3 ]
Han, Jiawei [1 ]
机构
[1] Univ Illinois, Dept Comp Sci, Urbana, IL 61801 USA
[2] Microsoft Res, Redmond, WA USA
[3] Computat & Informat Sci Directorate Army Res Lab, Adelphi, MD USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2014年 / 8卷 / 03期
基金
美国国家科学基金会;
关键词
D O I
10.14778/2735508.2735519
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
While most topic modeling algorithms model text corpora with unigrams, human interpretation often relies on inherent grouping of terms into phrases. As such, we consider the problem of discovering topical phrases of mixed lengths. Existing work either performs post processing to the results of unigram-based topic models, or utilizes complex n-gram -discovery topic models. These methods generally produce low-quality topical phrases or suffer from poor scalability on even moderately-sized data sets. We propose a different approach that is both computationally efficient and effective. Our solution combines a novel phrase mining framework to segment a document into single and multi-word phrases, and a new topic model that operates on the induced document partition. Our approach discovers high quality topical phrases with negligible extra cost to the bag-of-words topic model in a variety of datasets including research publication titles, abstracts, reviews, and news articles.
引用
收藏
页码:305 / 316
页数:12
相关论文
共 28 条
[1]  
AGRAWAL R, 1994, P 20 INT C VER LARG, V1215, P487
[2]  
Blei David M., 2009, ARXIV09071013
[3]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[4]  
Chang J, 2009, ADV NEURAL INFORM PR, P288
[5]  
Church K., 1991, USING STAT LEXICAL A, P115
[6]  
Danilevsky M., 2014, SDM
[7]  
Griffiths T., 2002, STANDFORD U, P1
[8]  
Gruber Amit, 2007, AISTATS
[9]  
Halliday M. A. K, 1966, MEMORY JR FIRTH, P148
[10]  
Han J, 2012, MOR KAUF D, P1