Multiple Parallel MapReduce k-means Clustering with Validation and Selection

被引:7
作者
Garcia, Kemilly Dearo [1 ]
Naldi, Murilo Coelho [1 ]
机构
[1] UFV, Dept Exact & Technol Sci, Rio Paranaiba, Brazil
来源
2014 BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS) | 2014年
关键词
distributed clustering; k-means; MapReduce;
D O I
10.1109/BRACIS.2014.83
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Dealing with big amounts of data is one of the challenges for clustering, which causes the need for distribution and management of huge data sets in separate repositories. New distributed systems have been designed to scale up from a single server to thousands of machines. The MapReduce framework allows to divide a job and combine the results seamlessly. The k-means is one of the few clustering algorithms that satisfies the MapReduce constrains, but it requires the previous specification of the number of clusters and is sensitive to their initialization. In this work, we propose a MapReduce clustering algorithm to execute multiple parallel runs of k-means with different initializations and number of clusters. Additionally, a MapReduce version of a cluster relative validity index is implemented and used to find the best result. The proposed algorithm is experimentally compared with the Apache Mahout Project's MapReduce implementation of k-means. Statistical tests applied on the results indicate that the proposed algorithm can outperform the Mahout's implementation when multiple k-means partitions are required.
引用
收藏
页码:432 / 437
页数:6
相关论文
共 50 条
  • [41] PSO Aided k-Means Clustering: Introducing Connectivity in k-Means
    Breaban, Mihaela Elena
    Luchian, Henri
    GECCO-2011: PROCEEDINGS OF THE 13TH ANNUAL GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE, 2011, : 1227 - 1234
  • [42] Performance Analysis of Parallel K-Means with Optimization Algorithms for Clustering on Spark
    Santhi, V.
    Jose, Rini
    DISTRIBUTED COMPUTING AND INTERNET TECHNOLOGY (ICDCIT 2018), 2018, 10722 : 158 - 162
  • [43] A Parallel Forecasting Approach Using Incremental K-means Clustering Technique
    Sahoo, Swagatika
    COMPUTATIONAL INTELLIGENCE IN DATA MINING, CIDM 2016, 2017, 556 : 165 - 172
  • [44] Scalable Fast Evolutionary k-means Clustering
    de Oliveira, Gilberto Viana
    Naldi, Murilo Coelho
    2015 BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS 2015), 2015, : 74 - 79
  • [45] External validation measures for K-means clustering: A data distribution perspective
    Wu, Junjie
    Chen, Jian
    Xiong, Hui
    Xie, Ming
    EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (03) : 6050 - 6061
  • [46] Balanced seed selection for K-means clustering with determinantal point process
    Bajpai, Namita
    Paik, Jiaul H.
    Sarkar, Sudeshna
    PATTERN RECOGNITION, 2025, 164
  • [47] K-Means Clustering Efficient Algorithm with Initial Class Center Selection
    Huang Suyu
    Hu Pingfang
    PROCEEDINGS OF THE 2018 3RD INTERNATIONAL WORKSHOP ON MATERIALS ENGINEERING AND COMPUTER SCIENCES (IWMECS 2018), 2018, 78 : 301 - 305
  • [48] Initial Centroid Selection Method for an Enhanced K-means Clustering Algorithm
    Aamer, Youssef
    Benkaouz, Yahya
    Ouzzif, Mohammed
    Bouragba, Khalid
    UBIQUITOUS NETWORKING, UNET 2019, 2020, 12293 : 182 - 190
  • [49] Parallel Two-Phase K-Means
    Cuong Duc Nguyen
    Dung Tien Nguyen
    Van-Hau Pham
    COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2013, PT V, 2013, 7975 : 224 - 231
  • [50] RETRACTED ARTICLE: Innovative study on clustering center and distance measurement of K-means algorithm: mapreduce efficient parallel algorithm based on user data of JD mall
    Yang Liu
    Xinxin Du
    Shuaifeng Ma
    Electronic Commerce Research, 2023, 23 : 43 - 73