Mass of short texts clustering and topic extraction based on frequent itemsets

被引:0
作者
Peng, Min [1 ,2 ]
Huang, Jiajia [1 ]
Zhu, Jiahui [3 ]
Huang, Jimin [1 ]
Liu, Jiping [1 ]
机构
[1] Computer School, Wuhan University, Wuhan
[2] Shenzhen Research, Wuhan University, Shenzhen, 518057, Guangdong
[3] State Key Laboratory of Software Engineering (Wuhan University), Wuhan
来源
Jisuanji Yanjiu yu Fazhan/Computer Research and Development | 2015年 / 52卷 / 09期
关键词
Clustering; Frequent itemsets; Large-scale; Short texts; Topic extraction;
D O I
10.7544/issn1000-1239.2015.20140533
中图分类号
学科分类号
摘要
Short texts generated in social media have the characteristics of volume, velocity, low quality and variety, thus make the vector-space-based clustering methods face the challenges of high-dimensions, features sparsity and noisy disturbing. In this paper, we propose a short texts clustering and topic extraction (STC-TE) framework based on the frequent itemsets mined from the texts. This framework firstly studies the impact of multi-features on the short texts' quality. Then, a large amount of frequent itemsets are dug out from the high quality short text set via setting a low support level, and a similar itemsets filtering strategy is devised to discard most of the unimportant frequent itemsets. Furthermore, based on the frequent itemsets similarity evaluated by relevant texts, we proposed a cluster self-adaptive spectral clustering (CSA_SC) algorithm to form the itemsets into different topic clusters. At last, the large-scale of short texts are classified into associated clusters according to the topic words extracted from the frequent itemset clusters. The framework is tested on one million of SinaWeibo dataset to evaluate the performance of the important frequent itemset selection and clustering, the topic words extraction, and the large scale of short texts classification. Experimental results show that the STC-TE framework can achieve topic extraction and large-scale short texts clustering with high accuracy. ©, 2015, Science Press. All right reserved.
引用
收藏
页码:1941 / 1953
页数:12
相关论文
共 31 条
[11]  
Sahami M., Heilman T.D., A Web-based kernel function for measuring the similarity of short text snippets, Proc of the 15th Int Conf on World Wide Web (WWW'06), pp. 377-386, (2006)
[12]  
Bollegala D., Matsuo Y., Ishizuka M., Measuring semantic similarity between words using Web search engines, Proc of the 16th Int Conf on World Wide Web (WWW'07), pp. 757-766, (2007)
[13]  
Banerjee S., Ramanathan K., Gupta A., Clustering short texts using Wikipedia, Proc of the 30th Annual Int ACM SIGIR Conf on Research and Development in Information Retrieval (SIGIR'07), pp. 787-788, (2007)
[14]  
Moe R.E., Improvements to suffix tree clustering, LNCS 8416: Proc of the 36th European Conf on IR Research (ECIR'14), pp. 662-667, (2014)
[15]  
Tian Y., Li H., Cai Q., Et al., Measuring the similarity of short texts by word similarity and tree kernels, Proc of IEEE Youth Conf on Information Computing and Telecommunications (YC-ICT'10), pp. 363-366, (2010)
[16]  
Chen X., Zhang Y., Cao L., Et al., An improved feature selection method for Chinese short texts clustering based on HowNet, LNEE 277: Proc of 2013 Int Conf on Computer Engineering and Networking (CENet'13), pp. 635-642, (2014)
[17]  
Hu X., Sun N., Zhang C., Et al., Exploiting internal and external semantics for the clustering of short texts using world knowledge, Proc of the 18th ACM Conf on Information and Knowledge Management (CIKM'09), pp. 919-928, (2009)
[18]  
Zamir O., Etzioni O., Web document clustering: A feasibility demonstration, Proc of the 21st Annual Int ACM SIGIR Conf on Research and Development in Information Retrieval (SIGIR'98), pp. 46-54, (1998)
[19]  
Park J., Gao X., Andreae P., Query directed Web page clustering using suffix tree and Wikipedia links, LNAI 7713: Advanced Data Mining and Applications, pp. 91-99, (2012)
[20]  
Su C., Chen Q., Wang X., Et al., Search result clustering algorithm based on maximal frequent itemsets, Journal of Chinese Information Processing, 24, 2, pp. 58-67, (2010)