DIC-DOC-K-means: Dissimilarity-based Initial Centroid selection for DOCument clustering using K-means for improving the effectiveness of text document clustering

被引：14

作者：

Lakshmi, R. ^{[1
]}

Baskar, S. ^{[2
]}

机构：

[1] KLN Coll Engn, Dept Comp Sci & Engn, Pottapalayam 630612, Tamil Nadu, India

[2] Thiagarajar Coll Engn, Dept Elect & Elect Engn, Thiruparankundram, Tamil Nadu, India

来源：

JOURNAL OF INFORMATION SCIENCE | 2019年 / 45卷 / 06期

关键词：

Document clustering; entropy; F-measure; initial cluster centroids; K-means clustering; purity; ALGORITHM; CLASSIFICATION;

D O I：

10.1177/0165551518816302

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In this article, a new initial centroid selection for a K-means document clustering algorithm, namely, Dissimilarity-based Initial Centroid selection for DOCument clustering using K-means (DIC-DOC-K-means), to improve the performance of text document clustering is proposed. The first centroid is the document having the minimum standard deviation of its term frequency. Each of the other subsequent centroids is selected based on the dissimilarities of the previously selected centroids. For comparing the performance of the proposed DIC-DOC-K-means algorithm, the results of the K-means, K-means++ and weighted average of terms-based initial centroid selection + K-means (Weight_Avg_Initials + K-means) clustering algorithms are considered. The results show that the proposed DIC-DOC-K-means algorithm performs significantly better than the K-means, K-means++ and Weight_Avg_Initials+ K-means clustering algorithms for Reuters-21578 and WebKB with respect to purity, entropy and F-measure for most of the cluster sizes. The cluster sizes used for Reuters-8 are 8, 16, 24 and 32 and those for WebKB are 4, 8, 12 and 16. The results of the proposed DIC-DOC-K-means give a better performance for the number of clusters that are equal to the number of classes in the data set.

引用

页码：818 / 832

页数：15

共 41 条

[21]

Katara J., 2015, J COMPUT SCI TECHNOL, V15, P1

[22] An initial seed selection algorithm for k-means clustering of georeferenced data to improve replicability of cluster assignments for mapping application [J].

Khan, Fouad .

APPLIED SOFT COMPUTING, 2012, 12 (11) :3698-3700

[23] Cluster center initialization algorithm for K-means clustering [J].

Khan, SS ;

Ahmad, A .

PATTERN RECOGNITION LETTERS, 2004, 25 (11) :1293-1302

[24]

Kumar Y.K. Yugal., 2014, International Journal of Advanced Science and Technology, V62, P43

[25]

Li Xinwu., 201 0 INT C COMPUTER

[26] A Similarity Measure for Text Classification and Clustering [J].

Lin, Yung-Shen ;

Jiang, Jung-Yi ;

Lee, Shie-Jue .

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (07) :1575-1590

[27] Clustering tagged documents with labeled and unlabeled documents [J].

Liu, Chien-Liang ;

Hsaio, Wen-Hoar ;

Lee, Chia-Hoang ;

Chen, Chun-Hsien .

INFORMATION PROCESSING & MANAGEMENT, 2013, 49 (03) :596-606

[28] Exploring performance of clustering methods on document sentiment analysis [J].

Ma, Baojun ;

Yuan, Hua ;

Wu, Ye .

JOURNAL OF INFORMATION SCIENCE, 2017, 43 (01) :54-74

[29]

Mahesh Kumar K, 2016, INT C INT SYST CONTR

[30]

Mahmud MS, 2012, 2012 7TH INTERNATIONAL CONFERENCE ON ELECTRICAL AND COMPUTER ENGINEERING (ICECE)

← 1 2 3 4 5 →