An Improved K-means Algorithm Using Modified Cosine Distance Measure for Document Clustering Using Mahout with Hadoop

被引:0
作者
Sahu, Lokesh [1 ]
Mohan, Biju R. [1 ]
机构
[1] Natl Inst Technol Karnataka, Dept Informat Technol, Surathkal, Karnataka, India
来源
2014 9TH INTERNATIONAL CONFERENCE ON INDUSTRIAL AND INFORMATION SYSTEMS (ICIIS) | 2014年
关键词
Document Clustering; K-means; Hadoop; Mahout;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, we have proposed a novel K-means algorithm with modified Cosine Distance Measure for clustering of large datasets like Wikipedia latest articles and Reuters dataset. We are customizing Cosine Distance Measure for computing similarity between objects for improving cluster quality. Our method will calculate the similarity between objects by Cosine Distance Measure and then try to bring distance more closer by squaring the distance if it is between 0 to 0.5 else increase it. It will result in minimum Intra-cluster and maximizes Inter-cluster distance value. We are measuring cluster quality in term of Inter and Intra-cluster distances, good Feature weighting such as TF-IDF, Cluster Size and Top terms of the clusters. We have compared K-means algorithm by Cosine and modified Cosine Distance measure by setting performance metric such as Intercluster and Intra-cluster distances, Cluster size, Execution time etc. Our experimental result shows in minimizing Intra-cluster by 0.016% and maximizing Inter-cluster distance by 0.012%, reducing the cluster size by 1.5% and reducing sequence file size by 4%, that will result in good cluster quality.
引用
收藏
页码:1048 / 1052
页数:5
相关论文
共 9 条
  • [1] Anil Robin, 2011, MAHOUT ACTION, P115
  • [2] [Anonymous], 2012, Hadoop: The definitive guide
  • [3] [Anonymous], 2000, WORKSHOP ARTIFICIAL
  • [4] Berkhin P, 2006, GROUPING MULTIDIMENSIONAL DATA: RECENT ADVANCES IN CLUSTERING, P25
  • [5] Esteves Rui Maximo, 2011, Proceedings of the 2011 IEEE 3rd International Conference on Cloud Computing Technology and Science (CloudCom 2011), P565, DOI 10.1109/CloudCom.2011.86
  • [6] Esteves R. M., 2011, Proceedings 2011 25th IEEE International Conference on Advanced Information Networking and Applications Workshops (WAINA 2011), P514, DOI 10.1109/WAINA.2011.136
  • [7] Ferdous R., 2009, Proc. of AH-ICI, P1, DOI [DOI 10.1109/AHICI.2009.5340335, 10.1109/AHICI.2009.5340335]
  • [8] An efficient k-means clustering algorithm:: Analysis and implementation
    Kanungo, T
    Mount, DM
    Netanyahu, NS
    Piatko, CD
    Silverman, R
    Wu, AY
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2002, 24 (07) : 881 - 892
  • [9] YANG HW, 2010, E ED E BUS E MAN E L, P383, DOI DOI 10.1109/IC4E.2010.72