An Improved K-means Algorithm based on Mapreduce and Grid

被引:8
作者
Ma, Li [1 ,2 ,3 ]
Gu, Lei [1 ,2 ]
Li, Bo [1 ,4 ]
Ma, Yue [5 ]
Wang, Jin [1 ,2 ,3 ]
机构
[1] Nanjing Univ Informat Sci & Technol, Jiangsu Engn Ctr Network Monitoring, Nanjing 210044, Jiangsu, Peoples R China
[2] Nanjing Univ Informat Sci & Technol, Sch Comp & Software, Nanjing 210044, Jiangsu, Peoples R China
[3] Nanjing Univ Informat Sci & Technol, Key Lab Meteorol Disaster, Minist Educ, Nanjing 210044, Jiangsu, Peoples R China
[4] CMA Res Ctr Strateg Dev, Beijing 100081, Peoples R China
[5] Nanjing Univ Informat Sci & Technol, Sch Math & Stat, Nanjing 210044, Jiangsu, Peoples R China
来源
INTERNATIONAL JOURNAL OF GRID AND DISTRIBUTED COMPUTING | 2015年 / 8卷 / 01期
关键词
Cluster analysis; K-means; Grid; DBSCAN; MapReduce;
D O I
10.14257/ijgdc.2015.8.1.18
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The traditional K-means clustering algorithm is difficult to initialize the number of clusters K, and the initial cluster centers are selected randomly, this makes the clustering results very unstable. Meanwhile, algorithms are susceptible to noise points. To solve the problems, the traditional K-means algorithm is improved. The improved method is divided into the same grid in space, according to the size of the data point property value and assigns it to the corresponding grid. And count the number of data points in each grid. Selecting M(M>K) grids, comprising the maximum number of data points, and calculate the central point. These M central points as input data, and then to determine the k value based on the clustering results. In the M points, find K points farthest from each other and those K center points as the initial cluster center of K-means clustering algorithm. At the same time, the maximum value in M must be included in K. If the number of data in the grid less than the threshold, then these points will be considered as noise points and be removed. In order to make the improved algorithm can adapt to handle large data. We will parallel the improved k-mean algorithm and combined with the MapReduce framework. Theoretical analysis and experimental results show that the improved algorithm compared to the traditional K-means clustering algorithm has high quality results, less iteration and has good stability. Parallelized algorithm has a very high efficiency in data processing, and has good scalability and speedup.
引用
收藏
页码:189 / 199
页数:11
相关论文
共 18 条
  • [1] Memory Rehabilitation: Integrating Theory and Practice
    Ali, Tarick
    [J]. INTERNATIONAL PSYCHOGERIATRICS, 2011, 23 (02) : 338 - 339
  • [2] Bhavani R., 2011, Proceedings of the 2011 World Congress on Information and Communication Technologies (WICT), P132, DOI 10.1109/WICT.2011.6141231
  • [3] Deterministic Feature Selection for k-Means Clustering
    Boutsidis, Christos
    Magdon-Ismail, Malik
    [J]. IEEE TRANSACTIONS ON INFORMATION THEORY, 2013, 59 (09) : 6099 - 6110
  • [4] Power-Efficient Hardware Architecture of K-Means Clustering With Bayesian-Information-Criterion Processor for Multimedia Processing Applications
    Chen, Tse-Wei
    Sun, Chih-Hao
    Su, Hsiao-Hang
    Chien, Shao-Yi
    Deguchi, Daisuke
    Ide, Ichiro
    Murase, Hiroshi
    [J]. IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, 2011, 1 (03) : 357 - 368
  • [5] Daoud M. B., 2001, PATTERN RECOGN, P451
  • [6] Duda R., 2001, PATTERN CLASSIFICATI
  • [7] He J., INITIALIZATION CLUST
  • [8] Data clustering: A review
    Jain, AK
    Murty, MN
    Flynn, PJ
    [J]. ACM COMPUTING SURVEYS, 1999, 31 (03) : 264 - 323
  • [9] Kumar A, 2013, P IEEE INT C CONTR A, P1
  • [10] Liao Q., IMPROVED PARALLEL K