Inverted Index based Modified Version of K-Means Algorithm for Text Clustering

被引:16
作者
Jo, Taeho [1 ]
机构
[1] Inha Univ, Sch Comp & Informat Engn, Incheon, South Korea
来源
JOURNAL OF INFORMATION PROCESSING SYSTEMS | 2008年 / 4卷 / 02期
关键词
String Vector; K Means Algorithm; Text Clustering;
D O I
10.3745/JIPS.2008.4.2.067
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This research proposes a new strategy where documents are encoded into string vectors and modified version of k means algorithm to be adaptable to string vectors for text clustering. Traditionally, when k means algorithm is used for pattern classification, raw data should be encoded into numerical vectors. This encoding may be difficult, depending on a given application area of pattern classification. For example, in text clustering, encoding full texts given as raw data into numerical vectors leads to two main problems: huge dimensionality and sparse distribution. In this research, we encode full texts into string vectors, and modify the k means algorithm adaptable to string vectors for text clustering.
引用
收藏
页码:67 / 76
页数:10
相关论文
共 20 条
  • [1] Convergence of an EM-type algorithm for spatial clustering
    Ambroise, C
    Govaert, G
    [J]. PATTERN RECOGNITION LETTERS, 1998, 19 (10) : 919 - 927
  • [2] Bote VPG, 2002, INFORM PROCESS MANAG, V38, P79, DOI 10.1016/S0306-4573(00)00066-2
  • [3] A CLASSIFICATION EM ALGORITHM FOR CLUSTERING AND 2 STOCHASTIC VERSIONS
    CELEUX, G
    GOVAERT, G
    [J]. COMPUTATIONAL STATISTICS & DATA ANALYSIS, 1992, 14 (03) : 315 - 332
  • [4] MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM
    DEMPSTER, AP
    LAIRD, NM
    RUBIN, DB
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01): : 1 - 38
  • [5] Hatzivassiloglou V., 2000, SIGIR Forum, V34, P224
  • [6] Jackson P., 2002, NATURAL LANGUAGE PRO
  • [7] Jo T, 2005, IEEE IJCNN, P558
  • [8] Jo T., 2006, THESIS U OTTAWA CANA
  • [9] Jo T., 2000, P ICACT 2000, P124
  • [10] Jo T., 2007, INFORM SYSTEMS UNPUB