An Improved K-means Algorithm for Document Clustering

被引:8
作者
Wu, Guohua [1 ]
Lin, Hairong [1 ]
Fu, Ershuai [1 ]
Wang, Liuyang [1 ]
机构
[1] Hang Zhou Dian Zi Univ, Sch Comp Sci & Technol, Hangzhou, Zhejiang, Peoples R China
来源
2015 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND MECHANICAL AUTOMATION (CSMA) | 2015年
关键词
K-Means; SimHash; Text clustering;
D O I
10.1109/CSMA.2015.20
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
K-Means algorithm has a major shortcoming of high dimensional and sparse data. So the traditional measurement of the distance can't deal with the data effectively. Motivated by this, this paper proposed a K-Means algorithm based on SimHash. After preprocessing of the text, SimHash is used to calculate the feature vectors extracted, and then the fingerprint of each text is obtained. SimHash not only reduces the dimension of the text, but also directly calculates the Hamming distance between the fingerprints as the vector distance. According to the Hamming distance, it can judge which clustering the data is belongs to. Experimental result shows that the algorithm guarantees the quality of the clustering, and greatly reduces the speed of K-means clustering algorithm.
引用
收藏
页码:65 / 69
页数:5
相关论文
共 10 条
[1]  
[Anonymous], 2009, P WORLD C ENG, VI
[2]  
Deelers S, 2007, PROC WRLD ACAD SCI E, V26, P323
[3]  
Han J, 2014, IEEE INT CONF BIG DA, P591, DOI 10.1109/BigData.2014.7004279
[4]  
HAN Xiaohong, 2009, J TAIYUAN U TECHNOLO, V40, P236
[5]  
Lee J, 2014, JOINT INT CONF SOFT, P614, DOI 10.1109/SCIS-ISIS.2014.7044861
[6]   A new text clustering algorithm based on improved k_means [J].
Xinwu, Li .
Journal of Software, 2012, 7 (01) :95-101
[7]  
Song Kun, 2011, Computer Engineering and Applications, V47, P212, DOI 10.3778/j.issn.1002-8331.2011.34.059
[8]  
Sood S., 2011, CIKM, P1117, DOI 10.1145/2063576.2063737
[9]  
Wang Yuan, 2014, FAST TEXT ELIMINATIO
[10]   PCA-guided search for K-means [J].
Xu, Qin ;
Ding, Chris ;
Liu, Jinpei ;
Luo, Bin .
PATTERN RECOGNITION LETTERS, 2015, 54 :50-55