Evaluation of distributed data processing frameworks in hybrid clouds

被引：3

作者：

Ullah, Faheem ^{[1
]}

Dhingra, Shagun ^{[1
]}

Xia, Xiaoyu ^{[2
]}

Babar, M. Ali ^{[1
]}

机构：

[1] Univ Adelaide, Ctr Res Engn Software Technol CREST Lab, Sch Comp Sci, Adelaide, SA, Australia

[2] RMIT Univ, Sch Comp Technol, Melbourne, Australia

来源：

JOURNAL OF NETWORK AND COMPUTER APPLICATIONS | 2024年 / 224卷

关键词：

Hybrid cloud; Hadoop; Spark; Flink; Big data; SPARK;

D O I：

10.1016/j.jnca.2024.103837

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Distributed data processing frameworks (e.g., Hadoop, Spark, and Flink) are widely used to distribute data among computing nodes of a cloud. Recently, there have been increasing efforts aimed at evaluating the performance of distributed data processing frameworks hosted in private and public clouds. However, there is a paucity of research on evaluating the performance of these frameworks hosted in a hybrid cloud, which is an emerging cloud model that integrates private and public clouds to use the best of both worlds. Therefore, in this paper, we evaluate the performance of Hadoop, Spark, and Flink in a hybrid cloud in terms of execution time, resource utilization, horizontal scalability, vertical scalability, and cost. For this study, our hybrid cloud consists of OpenStack (private cloud) and MS Azure (public cloud). We use both batch and iterative workloads for the evaluation. Our results show that in a hybrid cloud (i) the execution time increases as more nodes are borrowed by the private cloud from the public cloud, (ii) Flink outperforms Spark, which in turn outperforms Hadoop in terms of execution time, (iii) Hadoop transfers the largest amount of data among the nodes during the workload execution while Spark transfers the least amount of data, (iv) all three frameworks horizontally scale better as compared to vertical scaling, and (v) Spark is found to be least expensive in terms of $ cost for data processing while Hadoop is found the most expensive.

引用

页数：14

共 38 条

[1]

Ahmed H., 2021, IEEE Trans. Big Data

[2] Performance Comparison of Spark Clusters Configured Conventionally and a Cloud Service [J].

Ahmed, Hameeza ;

Ismail, Muhammad Ali ;

Hyder, Muhammad Faraz ;

Sheraz, Syed Muhammad ;

Fouq, Nida .

4TH SYMPOSIUM ON DATA MINING APPLICATIONS (SDMA2016), 2016, 82 :99-106

[3] A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench [J].

Ahmed, N. ;

Barczak, Andre L. C. ;

Susnjak, Teo ;

Rashid, Mohammed A. .

JOURNAL OF BIG DATA, 2020, 7 (01)

[4]

Apache, 2009, Hadoop: An open-source software for reliable, scalable, distributed computnig

[5]

Atwa W., 2023, Enhancing map reduce computation integrity on hybrid cloud

[6] A Framework for Data-Intensive Computing with Cloud Bursting [J].

Bicer, Tekin ;

Chiu, David ;

Agrawal, Gagan .

2011 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2011, :169-177

[7]

Brikman Yevgeniy., 2019, Terraform: Up Running: Writing Infrastructure as Code

[8]

Carbone P, 2015, IEEE Data(base) Engineering Bulletin, V36, P28, DOI DOI 10.1109/IC2EW.2016.56

[9] Performance Model of MapReduce Iterative Applications for Hybrid Cloud Bursting [J].

Clemente-Castello, Francisco J. ;

Nicolae, Bogdan ;

Mayo, Rafael ;

Carlos Fernandez, Juan .

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2018, 29 (08) :1794-1807

[10]

Dimopoulos S, 2016, 2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), P335, DOI 10.1109/BigData.2016.7840620

← 1 2 3 4 →