Performance and Cost-Efficient Spark Job Scheduling Based on Deep Reinforcement Learning in Cloud Computing Environments

被引：57

作者：

Islam, Muhammed Tawfiqul ^{[1
]}

Karunasekera, Shanika ^{[1
]}

Buyya, Rajkumar ^{[1
]}

机构：

[1] Univ Melbourne, Sch Comp & Informat Syst, Cloud Comp & Distributed Syst CLOUDS, Melbourne, Vic 3010, Australia

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2022年 / 33卷 / 07期

基金：

澳大利亚研究理事会;

关键词：

Sparks; Cloud computing; Costs; Task analysis; Service level agreements; Big Data; Reinforcement learning; cost-efficiency; performance improvement; deep reinforcement learning;

D O I：

10.1109/TPDS.2021.3124670

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Big data frameworks such as Spark and Hadoop are widely adopted to run analytics jobs in both research and industry. Cloud offers affordable compute resources which are easier to manage. Hence, many organizations are shifting towards a cloud deployment of their big data computing clusters. However, job scheduling is a complex problem in the presence of various Service Level Agreement (SLA) objectives such as monetary cost reduction, and job performance improvement. Most of the existing research does not address multiple objectives together and fail to capture the inherent cluster and workload characteristics. In this article, we formulate the job scheduling problem of a cloud-deployed Spark cluster and propose a novel Reinforcement Learning (RL) model to accommodate the SLA objectives. We develop the RL cluster environment and implement two Deep Reinforce Learning (DRL) based schedulers in TF-Agents framework. The proposed DRL-based scheduling agents work at a fine-grained level to place the executors of jobs while leveraging the pricing model of cloud VM instances. In addition, the DRL-based agents can also learn the inherent characteristics of different types of jobs to find a proper placement to reduce both the total cluster VM usage cost and the average job duration. The results show that the proposed DRL-based algorithms can reduce the VM usage cost up to 30%.

引用

页码：1695 / 1710

页数：16

共 40 条

[1] [Anonymous], 2012, Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing (NSDI'12)
[2] Bao YX, 2019, IEEE INFOCOM SER, P505, DOI [10.1109/INFOCOM.2019.8737460, 10.1109/infocom.2019.8737460]
[3] Burer S., 2012, Surveys in Operations Research and Management Science, V17, P97, DOI [DOI 10.1016/J.SORMS.2012.08.001, 10.1016/j.sorms.2012.08.001]
[4] Scheduling Semiconductor Testing Facility by Using Cuckoo Search Algorithm With Reinforcement Learning and Surrogate Modeling
Cao, ZhengCai
Lin, ChengRan
Zhou, MengChu
Huang, Ran
[J]. IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, 2019, 16 (02) : 825 - 837
[5] A Novel Task Provisioning Approach Fusing Reinforcement Learning for Big Data
Cheng, Yongyi
Xu, Gaochao
[J]. IEEE ACCESS, 2019, 7 : 143699 - 143709
[6] A cost-benefit analysis of using cloud computing to extend the capacity of clusters
de Assuncao, Marcos Dias
di Costanzo, Alexandre
Buyya, Rajkumar
[J]. CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2010, 13 (03): : 335 - 347
[7] Quasar: Resource-Efficient and QoS-Aware Cluster Management
Delimitrou, Christina
Kozyrakis, Christos
[J]. ACM SIGPLAN NOTICES, 2014, 49 (04) : 127 - 143
[8] Justice: A Deadline-aware, Fair-share Resource Allocator for Implementing Multi-analytics
Dimopoulos, Stratos
Krintz, Chandra
Wolski, Rich
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2017, : 233 - 244
[9] George L., 2011, HBase: The Definitive Guide: Random Access to Your Planet-Size Data
[10] Ghodsi Ali, 2011, 8 USENIX S NETW SYST, P323

← 1 2 3 4 →