Hot topic identification from micro-blog based on improved Single-pass algorithm

被引:2
作者
Feng J. [1 ]
Ding Y. [2 ]
Luo X. [1 ]
机构
[1] College of Computer Science and Technology, Xi'an University of Science and Technology, Xi'an, Shaanxi
[2] Beijing Beibian MicroGrid Technology Co., Ltd., Beijing
关键词
clustering; Hot topic identification; Single-pass; word segmentation;
D O I
10.3233/JCM-170760
中图分类号
学科分类号
摘要
Hot topic identification from micro-blog is very important for detection and control of the public opinion. When using Single-pass algorithm to cluster hot topics for Chinese micro-blog, Chinese word segmentation technology is a necessary preprocessing, but it will introduce inevitable segment errors. This kind of errors will make topic identification has low clustering precision. To solve this problem, this paper proposed an improved algorithm based on Single-pass which combines CS (Cosine Similarity) and LCS (Longest Common Subsequences) to calculate the similarity between Chinese words. Experiments on three different micro-blog data sets for hot topic identification are made, and the results show that the improved algorithm has both higher recall rate and precision rate than the original ones. The proposed algorithm is feasible and effective. © 2017 - IOS Press and the authors. All rights reserved.
引用
收藏
页码:791 / 798
页数:7
相关论文
共 17 条
[1]  
Bin L., Yuan Z.J., Qiang L., Yang Z.J., Han L., Wei X.W., Review of Micro-blog analytics, J. Hebei University of Science and Technology, 36, pp. 100-109, (2015)
[2]  
Anna H., Similarity measures for text document clustering, New Zealand Computer Science Research Student Conference, pp. 49-56, (2008)
[3]  
Bin X.Z., Dong W., Guo Y.C., Review of public opinion monitoring technology and Application, J. Software, 33, pp. 322-326, (2012)
[4]  
Martin E., Peter K.H., Sander J., Michael W., Wei X.X., Incremental clustering for mining in a data warehousing environment, International Conference on Very Large Databases, pp. 323-333, (1998)
[5]  
Bottou L., Bengio Y., Convergence properties of the K-means algorithms, J. Advances in Neural Information Processing Systems, 7, pp. 585-592, (1994)
[6]  
Stephen C.J., Hierarchical clustering schemes, J. Psychometrika, 32, pp. 241-254, (1967)
[7]  
Peter K.H., Peer K., Jorg S., Zimek A., Density-based clustering, J. WIREs Data Mining Knowledge Discovery, 10, pp. 231-240, (2011)
[8]  
Park H.N., Lee S.W., Statistical grid-based clustering over data streams, J. Acm Sigmod Record, 33, pp. 32-37, (2004)
[9]  
Xia L., Chen M.C., Design of network public opinion monitoring system in military hospital based on Single-pass, J. Electronic Design Engineering, 4, pp. 60-63, (2015)
[10]  
Ji G.S., Jie Q.S., Nan H., Song Z.X., Yan Y., An Y.C., Jian K., Online public opinion hotspot discovery algorithm based on Single-pass, J. University of Electronic Science and Technology, 4, pp. 599-604, (2015)