Topic Detection from Microblogs Using T-LDA and Perplexity

被引:35
作者
Huang, Ling [1 ]
Ma, Jinyu [1 ]
Chen, Chunling [2 ]
机构
[1] Nanjing Univ, Software Inst, Nanjing, Jiangsu, Peoples R China
[2] Nanjing Univ Posts & Telecommun, Coll Comp, Nanjing, Jiangsu, Peoples R China
来源
2017 24TH ASIA-PACIFIC SOFTWARE ENGINEERING CONFERENCE WORKSHOPS (APSECW) | 2017年
基金
中国国家自然科学基金;
关键词
Latent Dirichlet Allocation; Term Frequency Inverse Document Frequency; T-LDA; Perplexity; Microblog;
D O I
10.1109/APSECW.2017.11
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Due to the short-form and large amount of microblogs, traditional latent dirichlet allocation (LDA) cannot be effectively applied to mining topics from the microblog contents. In this paper, we bring in Term Frequency-Inverse Document Frequency (TF-IDF) that can adjust the weight of words and calculate in a high speed without considering the influence of word positions in documents, to help extract the key words in a relatively short length of article. Combining LDA with TF-IDF, we come up with a new topic detection method named T-LDA. In addition, we utilize Perplexity-K curve to help us recognize the number of topics (i.e. K-value) with the maximum meaningfulness, in order to reduce human bias in deciding K-value. We captured 3407 Chinese microblogs, chose the most optimistic K-value according to Perplexity-K curve, and conducted a series comparative trials among T-LDA, LDA and K-Means. We found that T-LDA has a better performance than LDA and K-Means in terms of topics results, modeling time, Precision, Recall rate, and F-Measure, which indicates that the improvement on LDA is effective.
引用
收藏
页码:71 / 77
页数:7
相关论文
共 15 条
[1]   On-Line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking [J].
AlSumait, Loulwah ;
Barbara, Daniel ;
Domeniconi, Carlotta .
ICDM 2008: EIGHTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2008, :3-12
[2]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[3]  
Bo Huang, 2012, Rough Sets and Current Trends in Computing. Proceedings 8th International Conference, RSCTC 2012, P166, DOI 10.1007/978-3-642-32115-3_19
[4]   Popular Topic Detection in Chinese Micro-Blog Based on the Modified LDA Model [J].
Chen, Yuzhong ;
Li, Wanhua ;
Guo, Wenzhong ;
Guo, Kun .
2015 12TH WEB INFORMATION SYSTEM AND APPLICATION CONFERENCE (WISA), 2015, :37-42
[5]   A more efficient Gibbs sampler estimation using steady-state simulation: applications to public health studies [J].
Dunbar, Martin X. ;
Samawi, Hani M. ;
Vogel, Robert ;
Yu, Lili .
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION, 2014, 84 (09) :1931-1945
[6]   Sparse Subspace Clustering: Algorithm, Theory, and Applications [J].
Elhamifar, Ehsan ;
Vidal, Rene .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (11) :2765-2781
[7]  
Esparza S.Garcia., 2010, Proceedings of the fourth ACM conference on Recommender systems, RecSys '10, New York, NY, USA, P305, DOI DOI 10.1145/1864708.1864773
[8]  
Hong L., 2010, P 1 WORKSH SOC MED A, P80, DOI DOI 10.1145/1964858.1964870
[9]  
Kastrati Z, 2013, P 19 INT C KNOWLEDGE, P203
[10]  
Macleod N., 2011, P 10 INT ASS FOR LIN