A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets

被引:0
作者
Ankita Sinha
Prasanta K. Jana
机构
[1] IIT (ISM),Department of Computer Science and Engineering
[2] Dhanbad,undefined
来源
The Journal of Supercomputing | 2018年 / 74卷
关键词
Mahalanobis distance; Apache Hadoop; -means++ initialization; Genetic algorithm;
D O I
暂无
中图分类号
学科分类号
摘要
Clustering a large volume of data in a distributed environment is a challenging issue. Data stored across multiple machines are huge in size, and solution space is large. Genetic algorithm deals effectively with larger solution space and provides better solution. In this paper, we proposed a novel clustering algorithm for distributed datasets, using combination of genetic algorithm (GA) with Mahalanobis distance and k-means clustering algorithm. The proposed algorithm is two phased; in phase 1, GA is applied in parallel on data chunks located across different machines. Mahalanobis distance is used as fitness value in GA, which considers covariance between the data points and thus provides a better representation of initial data. K-means with K-means++\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ ++ $$\end{document} initialization is applied in phase 2 on intermediate output to get final result. The proposed algorithm is implemented on Hadoop framework, which is inherently designed to deal with distributed datasets in a fault-tolerant manner. Extensive experiments were conducted for multiple real-life and synthetic datasets to measure performance of our proposed algorithm. Results were compared with MapReduce-based algorithms, mrk-means, parallel k-means and scaling GA.
引用
收藏
页码:1562 / 1579
页数:17
相关论文
共 50 条
  • [31] Combining K-MEANS and a genetic algorithm through a novel arrangement of genetic operators for high quality clustering
    Islam, Md Zahidul
    Estivill-Castro, Vladimir
    Rahman, Md Anisur
    Bossomaier, Terry
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2018, 91 : 402 - 417
  • [32] Mahalanobis Distance Based K-Means Clustering
    Brown, Paul O.
    Chiang, Meng Ching
    Guo, Shiqing
    Jin, Yingzi
    Leung, Carson K.
    Murray, Evan L.
    Pazdor, Adam G. M.
    Cuzzocrea, Alfredo
    [J]. BIG DATA ANALYTICS AND KNOWLEDGE DISCOVERY, DAWAK 2022, 2022, 13428 : 256 - 262
  • [33] An Effective Hybrid Method Based on DE, GA, and K-means for Data Clustering
    Prakash, Jay
    Singh, Pramod Kumar
    [J]. PROCEEDINGS OF THE SECOND INTERNATIONAL CONFERENCE ON SOFT COMPUTING FOR PROBLEM SOLVING (SOCPROS 2012), 2014, 236 : 1561 - 1572
  • [34] ARRHYTHMIA DISEASE DIAGNOSIS USING NEURAL NETWORK, SVM, AND GENETIC ALGORITHM-OPTIMIZED k-MEANS CLUSTERING
    Martis, Roshan Joy
    Chakraborty, Chandan
    [J]. JOURNAL OF MECHANICS IN MEDICINE AND BIOLOGY, 2011, 11 (04) : 897 - 915
  • [35] An adaptive and opposite K-means operation based memetic algorithm for data clustering
    Wang, Xi
    Wang, Zidong
    Sheng, Mengmeng
    Li, Qi
    Sheng, Weiguo
    [J]. NEUROCOMPUTING, 2021, 437 : 131 - 142
  • [36] Multi-Mode Active Suspension Control Based on a Genetic K-Means Clustering Linear Quadratic Algorithm
    Wu, Kun
    Liu, Jiang
    Li, Min
    Liu, Jianze
    Wang, Yushun
    [J]. APPLIED SCIENCES-BASEL, 2021, 11 (21):
  • [37] A hybrid genetic-fuzzy ant colony optimization algorithm for automatic K-means clustering in urban global positioning system
    Ran, Xiaojuan
    Suyaroj, Naret
    Tepsan, Worawit
    Ma, Jianghong
    Zhou, Xiangbing
    Deng, Wu
    [J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 137
  • [38] A hybrid genetic-fuzzy ant colony optimization algorithm for automatic K-means clustering in urban global positioning system
    Ran, Xiaojuan
    Suyaroj, Naret
    Tepsan, Worawit
    Ma, Jianghong
    Zhou, Xiangbing
    Deng, Wu
    [J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 137
  • [39] Genetic weighted k-means algorithm for clustering large-scale gene expression data
    Wu, Fang-Xiang
    [J]. BMC BIOINFORMATICS, 2008, 9 (Suppl 6)
  • [40] Genetic weighted k-means algorithm for clustering large-scale gene expression data
    Fang-Xiang Wu
    [J]. BMC Bioinformatics, 9