A Graph-based Bursty Topic Detection Approach in User-Genterated Texts

被引:0
作者
Zhao, Li [1 ]
Li, Yan [1 ]
Liu, Xinran [1 ]
Zhang, Hong [1 ]
机构
[1] Natl Comp Network Emergency Response Tech Team Co, Beijing, Peoples R China
来源
2014 11TH WEB INFORMATION SYSTEM AND APPLICATION CONFERENCE (WISA) | 2014年
关键词
Bursty Topic detection; Graph Theory; User-Generated Texts;
D O I
10.1109/WISA.2014.57
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The problem of hot bursty topic detection in user-generated texts deserves great attentions with the proliferation of Internet technologies. However, traditional document clustering and probabilistic topic models that were developed for formal news articles are less effective for informal user-generated corpora. In this paper, we provide a graph-based perspective that well reflects the latent pattern of bursty topics in text stream and develop an effective solution of the bursty topic detection problem. We represent texts with topics using a directed and weighted graph, with the bursty words as vertices and Tversky index of bursty words being edges. Topic detection from the texts is then converted into dividing the constructed graph into separate subgraphs, each significant subgraph corresponding to a bursty topic. To accomplish this, we partition the bursty word graph into the graph's strongly connected components, based on the analysis that the important topical words within a graph are connected to each other with high weights and thus form strongly connected components. We demonstrate through experiments on two user-generated corpora collected from English weblog and Chinese weibo (microblog) sites that the proposed approach can effectively detects the hot bursty topics, more appropriate than other topic detection models such as the LDA topic model and the EGF approach in TDT project.
引用
收藏
页码:273 / 278
页数:6
相关论文
共 21 条
[1]  
Allan J., 1998, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P37, DOI 10.1145/290941.290954
[2]  
[Anonymous], SDM
[3]  
[Anonymous], 2001, Introduction to algorithms
[4]  
Blei D., 2006, ADV NEURAL INFORM PR
[5]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[6]  
Cheriyan J, 1996, ALGORITHMICA, V15, P521, DOI 10.1007/BF01940880
[7]   On ontology-driven document clustering using core semantic features [J].
Fodeh, Samah ;
Punch, Bill ;
Tan, Pang-Ning .
KNOWLEDGE AND INFORMATION SYSTEMS, 2011, 28 (02) :395-421
[8]   Finding scientific topics [J].
Griffiths, TL ;
Steyvers, M .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2004, 101 :5228-5235
[9]   Probabilistic latent semantic indexing [J].
Hofmann, T .
SIGIR'99: PROCEEDINGS OF 22ND INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 1999, :50-57
[10]   Bursty and hierarchical structure in streams [J].
Kleinberg, J .
DATA MINING AND KNOWLEDGE DISCOVERY, 2003, 7 (04) :373-397