DIC-DOC-K-means: Dissimilarity-based Initial Centroid selection for DOCument clustering using K-means for improving the effectiveness of text document clustering

被引:14
作者
Lakshmi, R. [1 ]
Baskar, S. [2 ]
机构
[1] KLN Coll Engn, Dept Comp Sci & Engn, Pottapalayam 630612, Tamil Nadu, India
[2] Thiagarajar Coll Engn, Dept Elect & Elect Engn, Thiruparankundram, Tamil Nadu, India
关键词
Document clustering; entropy; F-measure; initial cluster centroids; K-means clustering; purity; ALGORITHM; CLASSIFICATION;
D O I
10.1177/0165551518816302
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this article, a new initial centroid selection for a K-means document clustering algorithm, namely, Dissimilarity-based Initial Centroid selection for DOCument clustering using K-means (DIC-DOC-K-means), to improve the performance of text document clustering is proposed. The first centroid is the document having the minimum standard deviation of its term frequency. Each of the other subsequent centroids is selected based on the dissimilarities of the previously selected centroids. For comparing the performance of the proposed DIC-DOC-K-means algorithm, the results of the K-means, K-means++ and weighted average of terms-based initial centroid selection + K-means (Weight_Avg_Initials + K-means) clustering algorithms are considered. The results show that the proposed DIC-DOC-K-means algorithm performs significantly better than the K-means, K-means++ and Weight_Avg_Initials+ K-means clustering algorithms for Reuters-21578 and WebKB with respect to purity, entropy and F-measure for most of the cluster sizes. The cluster sizes used for Reuters-8 are 8, 16, 24 and 32 and those for WebKB are 4, 8, 12 and 16. The results of the proposed DIC-DOC-K-means give a better performance for the number of clusters that are equal to the number of classes in the data set.
引用
收藏
页码:818 / 832
页数:15
相关论文
共 41 条
[1]  
Agarwal Manu, 2013, Theory and Applications of Models of Computation. 10th International Conference, TAMC 2013. Proceedings, P84, DOI 10.1007/978-3-642-38236-9_9
[2]  
Agha ME, 2010, INT J INTELL SYST AP, V1, P21
[3]   A new text categorization technique using distributional clustering and learning logic [J].
Al-Mubaid, Hisham ;
Umair, Syed A. .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2006, 18 (09) :1156-1165
[4]   Clustering of document collection - A weighting approach [J].
Aliguliyev, Ramiz M. .
EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (04) :7904-7916
[5]  
[Anonymous], 2007, P 18 ANN ACM SIAM S
[6]  
Aubaidan B., 2014, Journal of Computer Science, V10, P1197, DOI [DOI 10.3844/JCSSP.2014.1197.1206, 10.3844/jcssp.2014.1197.1206]
[7]   A similarity assessment technique for effective grouping of documents [J].
Basu, Tanmay ;
Murthy, C. A. .
INFORMATION SCIENCES, 2015, 311 :149-162
[8]  
Bide P, 2015, 2015 IEEE INTERNATIONAL CONFERENCE ON ELECTRICAL, COMPUTER AND COMMUNICATION TECHNOLOGIES
[9]   A comparative study of efficient initialization methods for the k-means clustering algorithm [J].
Celebi, M. Emre ;
Kingravi, Hassan A. ;
Vela, Patricio A. .
EXPERT SYSTEMS WITH APPLICATIONS, 2013, 40 (01) :200-210
[10]   Improved TFIDF in big news retrieval: An empirical study [J].
Chen, Chien-Hsing .
PATTERN RECOGNITION LETTERS, 2017, 93 :113-122