Chinese text clustering algorithm based k-means

被引:15
作者
Yao, Mingyu [1 ]
Pi, Dechang [1 ]
Cong, Xiangxiang [2 ]
机构
[1] Nanjing Univ Aeronaut & Astronaut, Coll Informat Sci & Technol, Nanjing 210016, Jiangsu, Peoples R China
[2] E China Univ Sci & Technol, Sch Informat Sci & Engn, Shanghai 200237, Peoples R China
来源
2012 INTERNATIONAL CONFERENCE ON MEDICAL PHYSICS AND BIOMEDICAL ENGINEERING (ICMPBE2012) | 2012年 / 33卷
关键词
text cluster; k-means; Chinese text;
D O I
10.1016/j.phpro.2012.05.066
中图分类号
Q6 [生物物理学];
学科分类号
071011 ;
摘要
Text clustering is an important means and method in text mining. The process of Chinese text clustering based on k-means was emphasized, we found that new center of a cluster was easily effected by isolated text after some experiments. Average similarity of one cluster was used as a parameter, and multiplied it with a modulus between 0.75 and 1.25 to get the similarity threshold value, the texts whose similarity with original cluster center was greater than or equal to the threshold value ware collected as a candidate collection, then updated the cluster center with center of candidate collection. The experiments show that improved method averagely increased purity and F value about 10 percent over the original method. (C) 2012 Published by Elsevier B.V. Selection and/or peer review under responsibility of ICMPBE International Committee.
引用
收藏
页码:301 / 307
页数:7
相关论文
共 9 条
[1]  
Barzilay Regina, 2001, P WORKSH SUMM NAACL
[2]  
CUTTING DR, 1992, SIGIR 92 : PROCEEDINGS OF THE FIFTEENTH ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, P318
[3]   Concept decompositions for large sparse text data using clustering [J].
Dhillon, IS ;
Modha, DS .
MACHINE LEARNING, 2001, 42 (1-2) :143-175
[4]  
FANG YC, 2002, P IEEE ICDM WORKSH T
[5]  
Hua-Jun Zeng, 2004, Proceedings of Sheffield SIGIR 2004. The Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P210
[6]  
Li Xiao-Guang, 2008, Journal of Software, V19, P2276, DOI 10.3724/SPJ.1001.2008.02276
[7]   VECTOR-SPACE MODEL FOR AUTOMATIC INDEXING [J].
SALTON, G ;
WONG, A ;
YANG, CS .
COMMUNICATIONS OF THE ACM, 1975, 18 (11) :613-620
[8]  
Steinbach M., 2000, P KDD WORKSH TEXT MI, P1
[9]   Hierarchical clustering algorithms for document datasets [J].
Zhao, Y ;
Karypis, G .
DATA MINING AND KNOWLEDGE DISCOVERY, 2005, 10 (02) :141-168