Text Document Clustering Based on Density K-means

被引：0

作者：

Wu, Di ^{[1
]}

Zeng, Yan ^{[2
]}

Qu, Yin-chuan ^{[2
]}

机构：

[1] Natl Univ Def, Coll Comp, Changsha, Hunan, Peoples R China

[2] Beijing Gaodi Informat Technol Co Ltd, Beijing, Peoples R China

来源：

INTERNATIONAL CONFERENCE ON COMPUTER, MECHATRONICS AND ELECTRONIC ENGINEERING (CMEE 2016) | 2016年

基金：

中国国家自然科学基金;

关键词：

K-means; Density; Text document; Clustering; NUMBER;

D O I：

暂无

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

K-means is one of the most fundamental techniques in clustering. It has been applied in many fields, such as image processing and Natural Language Processing. It has good performance in many cases, especially in dealing with large data sets. However, how to choose the initial cluster centers is a hard problem, different choice may cause the clustering results by K-means unstable even get the local optimum. To solve this problem, many methods have be proposed, while these methods only apply in some certain fields and perform disappointed when we use for text documents clustering. In this paper, we designed a novel density K-means algorithm and apply it in the text document clustering. The experimental results show that it performs better than most of the existing methods in Chinese corpus. Furthermore, compared with other algorithms, our algorithm can effectively decrease the iterations.

引用

页数：8

共 18 条

[1]

Bernotas M., 2015, INFORM TECHNOLOGY CO, V36

[2] Dimensionality Reduction for k-Means Clustering and Low Rank Approximation [J].

Cohen, Michael B. ;

Elder, Sam ;

Musco, Cameron ;

Musco, Christopher ;

Persu, Madalina .

STOC'15: PROCEEDINGS OF THE 2015 ACM SYMPOSIUM ON THEORY OF COMPUTING, 2015, :163-172

[3]

FORGY EW, 1965, BIOMETRICS, V21, P768

[4] A non-parametric method to estimate the number of clusters [J].

Fujita, Andre ;

Takahashi, Daniel Y. ;

Patriota, Alexandre G. .

COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2014, 73 :27-39

[5] A text similarity measurement combining word semantic information with TF-IDF method [J].

Huang C.-H. ;

Yin J. ;

Hou F. .

Jisuanji Xuebao/Chinese Journal of Computers, 2011, 34 (05) :856-864

[6]

Jiawei H., 2001, Data mining: concepts and techniques, V5

[7]

Joshi AmeyaC., 2015, Enforcing document clustering for forensic analysis using weighted matrix method (wmm)

[8] Cluster center initialization algorithm for K-means clustering [J].

Khan, SS ;

Ahmad, A .

PATTERN RECOGNITION LETTERS, 2004, 25 (11) :1293-1302

[9]

MacQueen, 1967, BERK S MATH STAT PRO, DOI DOI 10.1007/S11665-016-2173-6

[10] AN EXAMINATION OF PROCEDURES FOR DETERMINING THE NUMBER OF CLUSTERS IN A DATA SET [J].

MILLIGAN, GW ;

COOPER, MC .

PSYCHOMETRIKA, 1985, 50 (02) :159-179

← 1 2 →