Competitive K-means

被引:32
作者
Esteves, Rui Maximo [1 ]
Hacker, Thomas [2 ]
Rong, Chunming [1 ]
机构
[1] Univ Stavanger, Dept Elect & Comp Engn, Stavanger, Norway
[2] Purdue Univ, Comp & Informat Technol, W Lafayette, IN 47907 USA
来源
2013 IEEE FIFTH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING TECHNOLOGY AND SCIENCE (CLOUDCOM), VOL 1 | 2013年
关键词
K-means; K-means plus; Streaming K-means; MapReduce;
D O I
10.1109/CloudCom.2013.89
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The tremendous growth in data volumes has created a need for new tools and algorithms to quickly analyze large datasets. Cluster analysis techniques, such as K-means can be used for large datasets distributed across several machines. The accuracy of K-means depends on the selection of seed centroids during initialization. K-means++ improves on the K-means seeder, but suffers from problems when it is applied to large datasets: (a) the random algorithm it employs can produce inconsistent results across several analysis runs under the same initial conditions; and (b) it scales poorly for large datasets. In this paper we describe a new Competitive K-means algorithm we developed that addresses both of these problems. We describe an efficient MapReduce implementation of our new Competitive K-means algorithm that we found scales well with large datasets. We compared the performance of our new algorithm with three existing cluster analysis algorithms and found that our new algorithm improves cluster analysis accuracy and decreases variance. Our results show that our new algorithm produced a speedup of 76 +/- 9 times compared with the serial K-means++ and is as fast as the Streaming K-means. Our work provides a method to select a good initial seeding in less time, facilitating accurate cluster analysis over large datasets in shorter time.
引用
收藏
页码:17 / 24
页数:8
相关论文
共 50 条
  • [41] Research and Improve on K-means Algorithm Based on Hadoop
    Wu, Kehe
    Zeng, Wenjing
    Wu, Tingting
    An, Yanwen
    PROCEEDINGS OF 2015 6TH IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND SERVICE SCIENCE, 2015, : 334 - 337
  • [42] Improving k-means through distributed scalable metaheuristics
    Oliveira, G. V.
    Coutinho, F. P.
    Campello, R. J. G. B.
    Naldi, M. C.
    NEUROCOMPUTING, 2017, 246 : 45 - 57
  • [43] Improved MapReduce k-Means Clustering Algorithm with Combiner
    Anchalia, Prajesh P.
    2014 UKSIM-AMSS 16TH INTERNATIONAL CONFERENCE ON COMPUTER MODELLING AND SIMULATION (UKSIM), 2014, : 386 - 391
  • [44] Performance evaluation of K-means clustering on Hadoop infrastructure
    Vats, Satvik
    Sagar, B. B.
    JOURNAL OF DISCRETE MATHEMATICAL SCIENCES & CRYPTOGRAPHY, 2019, 22 (08) : 1349 - 1363
  • [45] An Improved K-means Algorithm based on Mapreduce and Grid
    Ma, Li
    Gu, Lei
    Li, Bo
    Ma, Yue
    Wang, Jin
    INTERNATIONAL JOURNAL OF GRID AND DISTRIBUTED COMPUTING, 2015, 8 (01): : 189 - 199
  • [46] An efficient approximation to the K-means clustering for massive data
    Capo, Marco
    Perez, Aritz
    Lozano, Jose A.
    KNOWLEDGE-BASED SYSTEMS, 2017, 117 : 56 - 69
  • [47] A MapReduce framework to implement Enhanced K-means algorithm
    Purohit, Bhimasen. V.
    Shettar, Rajashree
    PROCEEDINGS OF THE 2015 INTERNATIONAL CONFERENCE ON APPLIED AND THEORETICAL COMPUTING AND COMMUNICATION TECHNOLOGY (ICATCCT), 2015, : 361 - 363
  • [48] A MapReduce-based K-means clustering algorithm
    Mao, YiMin
    Gan, DeJin
    Mwakapesa, D. S.
    Nanehkaran, Y. A.
    Tao, Tao
    Huang, XueYu
    JOURNAL OF SUPERCOMPUTING, 2022, 78 (04) : 5181 - 5202
  • [49] Research on Improved K-Means Algorithm Based on Hadoop
    Wei Xiaojing
    Li Yuanbo
    2017 4TH INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND CONTROL ENGINEERING (ICISCE), 2017, : 593 - 598
  • [50] t-k-means: A ROBUST AND STABLE k-means VARIANT
    Li, Yiming
    Zhang, Yang
    Tang, Qingtao
    Huang, Weipeng
    Jiang, Yong
    Xia, Shu-Tao
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 3120 - 3124