Inverted Index based Modified Version of K-Means Algorithm for Text Clustering

被引：16

作者：

Jo, Taeho ^{[1
]}

机构：

[1] Inha Univ, Sch Comp & Informat Engn, Incheon, South Korea

来源：

JOURNAL OF INFORMATION PROCESSING SYSTEMS | 2008年 / 4卷 / 02期

关键词：

String Vector; K Means Algorithm; Text Clustering;

D O I：

10.3745/JIPS.2008.4.2.067

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This research proposes a new strategy where documents are encoded into string vectors and modified version of k means algorithm to be adaptable to string vectors for text clustering. Traditionally, when k means algorithm is used for pattern classification, raw data should be encoded into numerical vectors. This encoding may be difficult, depending on a given application area of pattern classification. For example, in text clustering, encoding full texts given as raw data into numerical vectors leads to two main problems: huge dimensionality and sparse distribution. In this research, we encode full texts into string vectors, and modify the k means algorithm adaptable to string vectors for text clustering.

引用

页码：67 / 76

页数：10

共 20 条

[1] Convergence of an EM-type algorithm for spatial clustering
Ambroise, C
Govaert, G
[J]. PATTERN RECOGNITION LETTERS, 1998, 19 (10) : 919 - 927
[2] Bote VPG, 2002, INFORM PROCESS MANAG, V38, P79, DOI 10.1016/S0306-4573(00)00066-2
[3] A CLASSIFICATION EM ALGORITHM FOR CLUSTERING AND 2 STOCHASTIC VERSIONS
CELEUX, G
GOVAERT, G
[J]. COMPUTATIONAL STATISTICS & DATA ANALYSIS, 1992, 14 (03) : 315 - 332
[4] MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM
DEMPSTER, AP
LAIRD, NM
RUBIN, DB
[J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01): : 1 - 38
[5] Hatzivassiloglou V., 2000, SIGIR Forum, V34, P224
[6] Jackson P., 2002, NATURAL LANGUAGE PRO
[7] Jo T, 2005, IEEE IJCNN, P558
[8] Jo T., 2006, THESIS U OTTAWA CANA
[9] Jo T., 2000, P ICACT 2000, P124
[10] Jo T., 2007, INFORM SYSTEMS UNPUB

← 1 2 →