Multiple Parallel MapReduce k-means Clustering with Validation and Selection

被引：7

作者：

Garcia, Kemilly Dearo ^{[1
]}

Naldi, Murilo Coelho ^{[1
]}

机构：

[1] UFV, Dept Exact & Technol Sci, Rio Paranaiba, Brazil

来源：

2014 BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS) | 2014年

关键词：

distributed clustering; k-means; MapReduce;

D O I：

10.1109/BRACIS.2014.83

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Dealing with big amounts of data is one of the challenges for clustering, which causes the need for distribution and management of huge data sets in separate repositories. New distributed systems have been designed to scale up from a single server to thousands of machines. The MapReduce framework allows to divide a job and combine the results seamlessly. The k-means is one of the few clustering algorithms that satisfies the MapReduce constrains, but it requires the previous specification of the number of clusters and is sensitive to their initialization. In this work, we propose a MapReduce clustering algorithm to execute multiple parallel runs of k-means with different initializations and number of clusters. Additionally, a MapReduce version of a cluster relative validity index is implemented and used to find the best result. The proposed algorithm is experimentally compared with the Apache Mahout Project's MapReduce implementation of k-means. Statistical tests applied on the results indicate that the proposed algorithm can outperform the Mahout's implementation when multiple k-means partitions are required.

引用

页码：432 / 437

页数：6

共 50 条

[41] PSO Aided k-Means Clustering: Introducing Connectivity in k-Means
Breaban, Mihaela Elena
Luchian, Henri
GECCO-2011: PROCEEDINGS OF THE 13TH ANNUAL GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE, 2011, : 1227 - 1234
[42] Performance Analysis of Parallel K-Means with Optimization Algorithms for Clustering on Spark
Santhi, V.
Jose, Rini
DISTRIBUTED COMPUTING AND INTERNET TECHNOLOGY (ICDCIT 2018), 2018, 10722 : 158 - 162
[43] A Parallel Forecasting Approach Using Incremental K-means Clustering Technique
Sahoo, Swagatika
COMPUTATIONAL INTELLIGENCE IN DATA MINING, CIDM 2016, 2017, 556 : 165 - 172
[44] Scalable Fast Evolutionary k-means Clustering
de Oliveira, Gilberto Viana
Naldi, Murilo Coelho
2015 BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS 2015), 2015, : 74 - 79
[45] External validation measures for K-means clustering: A data distribution perspective
Wu, Junjie
Chen, Jian
Xiong, Hui
Xie, Ming
EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (03) : 6050 - 6061
[46] Balanced seed selection for K-means clustering with determinantal point process
Bajpai, Namita
Paik, Jiaul H.
Sarkar, Sudeshna
PATTERN RECOGNITION, 2025, 164
[47] K-Means Clustering Efficient Algorithm with Initial Class Center Selection
Huang Suyu
Hu Pingfang
PROCEEDINGS OF THE 2018 3RD INTERNATIONAL WORKSHOP ON MATERIALS ENGINEERING AND COMPUTER SCIENCES (IWMECS 2018), 2018, 78 : 301 - 305
[48] Initial Centroid Selection Method for an Enhanced K-means Clustering Algorithm
Aamer, Youssef
Benkaouz, Yahya
Ouzzif, Mohammed
Bouragba, Khalid
UBIQUITOUS NETWORKING, UNET 2019, 2020, 12293 : 182 - 190
[49] Parallel Two-Phase K-Means
Cuong Duc Nguyen
Dung Tien Nguyen
Van-Hau Pham
COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2013, PT V, 2013, 7975 : 224 - 231
[50] RETRACTED ARTICLE: Innovative study on clustering center and distance measurement of K-means algorithm: mapreduce efficient parallel algorithm based on user data of JD mall
Yang Liu
Xinxin Du
Shuaifeng Ma
Electronic Commerce Research, 2023, 23 : 43 - 73

← 1 2 3 4 5 →