A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets

被引:31
作者
Sinha, Ankita [1 ]
Jana, Prasanta K. [1 ]
机构
[1] IIT ISM, Dept Comp Sci & Engn, Dhanbad, Bihar, India
关键词
Mahalanobis distance; Apache Hadoop; k-means plus plus initialization; Genetic algorithm; BIG DATA;
D O I
10.1007/s11227-017-2182-8
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Clustering a large volume of data in a distributed environment is a challenging issue. Data stored across multiple machines are huge in size, and solution space is large. Genetic algorithm deals effectively with larger solution space and provides better solution. In this paper, we proposed a novel clustering algorithm for distributed datasets, using combination of genetic algorithm (GA) with Mahalanobis distance and k-means clustering algorithm. The proposed algorithm is two phased; in phase 1, GA is applied in parallel on data chunks located across different machines. Mahalanobis distance is used as fitness value in GA, which considers covariance between the data points and thus provides a better representation of initial data. K-means with K-means initialization is applied in phase 2 on intermediate output to get final result. The proposed algorithm is implemented on Hadoop framework, which is inherently designed to deal with distributed datasets in a fault-tolerant manner. Extensive experiments were conducted for multiple real-life and synthetic datasets to measure performance of our proposed algorithm. Results were compared with MapReduce-based algorithms, mrk-means, parallel k-means and scaling GA.
引用
收藏
页码:1562 / 1579
页数:18
相关论文
共 50 条
  • [41] GAPBAS: Genetic algorithm-based privacy budget allocation strategy in differential privacy K-means clustering algorithm
    Li, Yong
    Song, Xiao
    Tu, Yuchun
    Liu, Ming
    COMPUTERS & SECURITY, 2024, 139
  • [42] Modifying Genetic Algorithm with Species and Sexual Selection by using K-means Algorithm
    Patel, Rahila
    Raghuwanshi, M. M.
    Jaiswal, Anil N.
    2009 IEEE INTERNATIONAL ADVANCE COMPUTING CONFERENCE, VOLS 1-3, 2009, : 114 - +
  • [43] K-Means Clustering Algorithm Based on Memristive Chaotic System and Sparrow Search Algorithm
    Wan, Yilin
    Xiong, Qi
    Qiu, Zhiwei
    Xie, Yaohan
    SYMMETRY-BASEL, 2022, 14 (10):
  • [44] An Enhanced K-Means Genetic Algorithms for Optimal Clustering
    Anusha, M.
    Sathiaseelan, J. G. R.
    2014 IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMPUTING RESEARCH (IEEE ICCIC), 2014, : 580 - 584
  • [45] On Solving 0/1 Multidimensional Knapsack Problem with a Genetic Algorithm Using a Selection Operator Based on K-Means Clustering Principle
    Laabadi, Soukaina
    Naimi, Mohamed
    El Amri, Hassan
    Achchab, Boujemaa
    FOUNDATIONS OF COMPUTING AND DECISION SCIENCES, 2022, 47 (03) : 247 - 269
  • [46] Modified K-means Algorithm for Big Data Clustering
    Sengupta, Debapriya
    Roy, Sayantan Singha
    Ghosh, Sarbani
    Dasgupta, Ranjan
    PROCEEDINGS 2017 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND COMPUTATIONAL INTELLIGENCE (CSCI), 2017, : 1443 - 1448
  • [47] Locality Preserving Based K-Means Clustering
    Yang, Xiaohuan
    Wang, Xiaoming
    Tian, Yong
    Du, Yajun
    INTELLIGENCE SCIENCE AND BIG DATA ENGINEERING: BIG DATA AND MACHINE LEARNING TECHNIQUES, ISCIDE 2015, PT II, 2015, 9243 : 86 - 95
  • [48] Combining K-MEANS and a genetic algorithm through a novel arrangement of genetic operators for high quality clustering
    Islam, Md Zahidul
    Estivill-Castro, Vladimir
    Rahman, Md Anisur
    Bossomaier, Terry
    EXPERT SYSTEMS WITH APPLICATIONS, 2018, 91 : 402 - 417
  • [49] An Effective Hybrid Method Based on DE, GA, and K-means for Data Clustering
    Prakash, Jay
    Singh, Pramod Kumar
    PROCEEDINGS OF THE SECOND INTERNATIONAL CONFERENCE ON SOFT COMPUTING FOR PROBLEM SOLVING (SOCPROS 2012), 2014, 236 : 1561 - 1572
  • [50] Mahalanobis Distance Based K-Means Clustering
    Brown, Paul O.
    Chiang, Meng Ching
    Guo, Shiqing
    Jin, Yingzi
    Leung, Carson K.
    Murray, Evan L.
    Pazdor, Adam G. M.
    Cuzzocrea, Alfredo
    BIG DATA ANALYTICS AND KNOWLEDGE DISCOVERY, DAWAK 2022, 2022, 13428 : 256 - 262