A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets

被引:32
作者
Sinha, Ankita [1 ]
Jana, Prasanta K. [1 ]
机构
[1] IIT ISM, Dept Comp Sci & Engn, Dhanbad, Bihar, India
关键词
Mahalanobis distance; Apache Hadoop; k-means plus plus initialization; Genetic algorithm; BIG DATA;
D O I
10.1007/s11227-017-2182-8
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Clustering a large volume of data in a distributed environment is a challenging issue. Data stored across multiple machines are huge in size, and solution space is large. Genetic algorithm deals effectively with larger solution space and provides better solution. In this paper, we proposed a novel clustering algorithm for distributed datasets, using combination of genetic algorithm (GA) with Mahalanobis distance and k-means clustering algorithm. The proposed algorithm is two phased; in phase 1, GA is applied in parallel on data chunks located across different machines. Mahalanobis distance is used as fitness value in GA, which considers covariance between the data points and thus provides a better representation of initial data. K-means with K-means initialization is applied in phase 2 on intermediate output to get final result. The proposed algorithm is implemented on Hadoop framework, which is inherently designed to deal with distributed datasets in a fault-tolerant manner. Extensive experiments were conducted for multiple real-life and synthetic datasets to measure performance of our proposed algorithm. Results were compared with MapReduce-based algorithms, mrk-means, parallel k-means and scaling GA.
引用
收藏
页码:1562 / 1579
页数:18
相关论文
共 34 条
[1]   NP-hardness of Euclidean sum-of-squares clustering [J].
Aloise, Daniel ;
Deshpande, Amit ;
Hansen, Pierre ;
Popat, Preyas .
MACHINE LEARNING, 2009, 75 (02) :245-248
[2]  
[Anonymous], INTRO GENETIC ALGORI
[3]  
[Anonymous], 2015, BIG DATA ANAL
[4]  
[Anonymous], 2007, P 18 ANN ACM SIAM S
[5]  
[Anonymous], 2016, UCI MACHINE LEARNING
[6]  
[Anonymous], 2006, Introduction to Data Mining
[7]  
[Anonymous], 2006, GENETIC ALGORITHMS
[8]  
[Anonymous], SIMILARITY MEASUREME
[9]  
[Anonymous], 2001, 3D DATA MANAGEMENT C
[10]  
[Anonymous], CLUSTERING ALGORITHM