An improved kNN text categorization algorithm based on cluster distribution

被引：0

作者：

Luo, Yuansheng ^{[1
,2
]}

Wang, Minweng ^{[2
,3
]}

Le, Zhongjian ^{[2
]}

Zhang, Huawei ^{[1
]}

机构：

[1] Modern Education Technology Center, Jiangxi University of Finance and Economics, Nanchang 330013, China

[2] School of Information Management, Jiangxi University of Finance and Economics, Nanchang 330013, China

[3] School of Computer Information and Engineering, Jiangxi Normal University, Nanchang 330022, China

来源：

Journal of Computational Information Systems | 2012年 / 8卷 / 03期

关键词：

Nearest neighbor search - Text processing - Learning algorithms - Sampling - Clustering algorithms;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

The traditional kNN text classification algorithm uses all training samples for classification, so its computation is very high for huge number of training samples. To address the problem, an improved kNN text classification algorithm based on cluster distribution is proposed in the paper. Firstly, the training sample sets of each category are clustered by k-means clustering algorithm, and all cluster centers are taken as the new training samples. Secondly, a weight value is introduced, which integrates the contribution of the large clusters, the effect of dispersive clusters and clustets distribution. Finally, the modified samples are trained for kNN text classification. The Experiments on Fudan university text classification corpus and 20 Newsgroups data set show that the proposed algorithm can not only effectively reduce the actual number of training samples and lower the computational complexity, but also improve the accuracy of kNN text classification algorithm. 1553-9105/Copyright © 2012 Binary Information Press.

引用

页码：1255 / 1263