A Performance Comparison of Big Data Processing Platform Based on Parallel Clustering Algorithms

被引:4
作者
Hai, Mo [1 ,2 ]
Zhang, Yuejing [1 ]
Li, Haifeng [1 ]
机构
[1] Cent Univ Finance & Econ, Sch Informat, Beijing 100081, Peoples R China
[2] Univ Elect Sci & Technol China, Network & Data Secur Key Lab Sichuan Prov, Chengdu 610054, Sichuan, Peoples R China
来源
6TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY AND QUANTITATIVE MANAGEMENT | 2018年 / 139卷
关键词
Hadoop; Spark; DataMPI; K-means; fuzzy K-means; Canopy; MAPREDUCE;
D O I
10.1016/j.procs.2018.10.228
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The performance of three typical big data processing platform: Hadoop, Spark and DataMPI are compared based on different parallel clustering algorithms: parallel K-means, parallel fuzzy K-means and parallel Canopy. Experiments are performed on different text as well as numeric dataset and clusters of different scale. The results show that: (1) for the same data set, when the memory of each node is 4GB, DataMPI can achieve about 60% performance improvement compared with Hadoop, and can achieve about 32% performance improvement compared with Spark; (2) in order to obtain a high clustering performance, a cluster with 6 nodes and 6GB memory of each node should be selected. (C) 2018 The Authors. Published by Elsevier B.V.
引用
收藏
页码:127 / 135
页数:9
相关论文
共 13 条
[1]  
[Anonymous], 2012, Login: The Usenix Magazine
[2]  
[Anonymous], 2011, MCKINSEY DIGITAL
[3]  
Dean J, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE SIXTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDE '04), P137
[4]   MapReduce: A Flexible Data Processing Tool [J].
Dean, Jeffrey ;
Ghemawat, Sanjay .
COMMUNICATIONS OF THE ACM, 2010, 53 (01) :72-77
[5]  
Gantz J.F., 2010, The Digital Universe Decade - Are You Ready?
[6]  
Gantz John., 2007, EXPANDING DIGITAL UN
[7]  
He Jun, 2012, THESIS
[8]  
Jianheng Lu, 2012, HADAOOP ACTION
[9]  
Karau H., 2013, Fast Data Processing With Spark
[10]  
Konstantin Shvachko, 2010, 26 S IEEE MASS STOR