A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets

被引:31
作者
Sinha, Ankita [1 ]
Jana, Prasanta K. [1 ]
机构
[1] IIT ISM, Dept Comp Sci & Engn, Dhanbad, Bihar, India
关键词
Mahalanobis distance; Apache Hadoop; k-means plus plus initialization; Genetic algorithm; BIG DATA;
D O I
10.1007/s11227-017-2182-8
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Clustering a large volume of data in a distributed environment is a challenging issue. Data stored across multiple machines are huge in size, and solution space is large. Genetic algorithm deals effectively with larger solution space and provides better solution. In this paper, we proposed a novel clustering algorithm for distributed datasets, using combination of genetic algorithm (GA) with Mahalanobis distance and k-means clustering algorithm. The proposed algorithm is two phased; in phase 1, GA is applied in parallel on data chunks located across different machines. Mahalanobis distance is used as fitness value in GA, which considers covariance between the data points and thus provides a better representation of initial data. K-means with K-means initialization is applied in phase 2 on intermediate output to get final result. The proposed algorithm is implemented on Hadoop framework, which is inherently designed to deal with distributed datasets in a fault-tolerant manner. Extensive experiments were conducted for multiple real-life and synthetic datasets to measure performance of our proposed algorithm. Results were compared with MapReduce-based algorithms, mrk-means, parallel k-means and scaling GA.
引用
收藏
页码:1562 / 1579
页数:18
相关论文
共 50 条
  • [31] Evolutionary Computing Assisted K-Means Clustering based MapReduce Distributed Computing Environment for IoT-Driven Smart City
    Srinivas, Kunal G.
    Hosahalli, Doreswamy
    2021 IEEE INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION, AND INTELLIGENT SYSTEMS (ICCCIS), 2021, : 192 - 200
  • [32] A Novel Genetic Algorithm Based k-means Algorithm for Cluster Analysis
    El-Shorbagy, M. A.
    Ayoub, A. Y.
    El-Desoky, I. M.
    Mousa, A. A.
    INTERNATIONAL CONFERENCE ON ADVANCED MACHINE LEARNING TECHNOLOGIES AND APPLICATIONS (AMLTA2018), 2018, 723 : 92 - 101
  • [33] A Coloured Image Watermarking Based on Genetic K-Means Clustering Methodology
    Hassan, Zainab Falah
    Al-Shareefi, Farah
    Gheni, Hadeel Qasem
    JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, 2023, 14 (02) : 242 - 249
  • [34] Enhanced Data Lake Clustering Design based on K-means Algorithm
    Kachaoui, Jabrane
    Belangour, Abdessamad
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (04) : 547 - 554
  • [35] SMK-means: An Improved Mini Batch K-means Algorithm Based on Mapreduce with Big Data
    Xiao, Bo
    Wang, Zhen
    Liu, Qi
    Liu, Xiaodong
    CMC-COMPUTERS MATERIALS & CONTINUA, 2018, 56 (03): : 365 - 379
  • [36] Context Quantization Based on The Modified Genetic Algorithm with K-means
    Chen, Min
    Chen, Jianhua
    2013 NINTH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION (ICNC), 2013, : 434 - 438
  • [37] Efficient adaptive large-scale text clustering method based on genetic K-means algorithm
    Dai, Wenhua
    Jiao, Cuizhen
    He, Tingting
    RECENT ADVANCE OF CHINESE COMPUTING TECHNOLOGIES, 2007, : 281 - 285
  • [38] Improving Performance of K-Means Clustering by Initializing Cluster Centers Using Genetic Algorithm and Entropy Based Fuzzy Clustering for Categorization of Diabetic Patients
    Karegowda, Asha Gowda
    Shama, Vidya T.
    Jayaram, M. A.
    Manjunath, A. S.
    PROCEEDINGS OF INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, 2013, 174 : 899 - 904
  • [39] Cloud Based K-Means Clustering Running as a MapReduce Job for Big Data Healthcare Analytics Using Apache Mahout
    Rallapalli, Sreekanth
    Gondkar, R. R.
    Rao, Golajapu Venu Madhava
    INFORMATION SYSTEMS DESIGN AND INTELLIGENT APPLICATIONS, VOL 1, INDIA 2016, 2016, 433 : 127 - 135
  • [40] ADAPTIVE K-MEANS ALGORITHM FOR OVERLAPPED GRAPH CLUSTERING
    Bello-Orgaz, Gema
    Menendez, Hector D.
    Camacho, David
    INTERNATIONAL JOURNAL OF NEURAL SYSTEMS, 2012, 22 (05)