Performance and Cost-Efficient Spark Job Scheduling Based on Deep Reinforcement Learning in Cloud Computing Environments

被引：55

作者：

Islam, Muhammed Tawfiqul ^{[1
]}

Karunasekera, Shanika ^{[1
]}

Buyya, Rajkumar ^{[1
]}

机构：

[1] Univ Melbourne, Sch Comp & Informat Syst, Cloud Comp & Distributed Syst CLOUDS, Melbourne, Vic 3010, Australia

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2022年 / 33卷 / 07期

基金：

澳大利亚研究理事会;

关键词：

Sparks; Cloud computing; Costs; Task analysis; Service level agreements; Big Data; Reinforcement learning; cost-efficiency; performance improvement; deep reinforcement learning;

D O I：

10.1109/TPDS.2021.3124670

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Big data frameworks such as Spark and Hadoop are widely adopted to run analytics jobs in both research and industry. Cloud offers affordable compute resources which are easier to manage. Hence, many organizations are shifting towards a cloud deployment of their big data computing clusters. However, job scheduling is a complex problem in the presence of various Service Level Agreement (SLA) objectives such as monetary cost reduction, and job performance improvement. Most of the existing research does not address multiple objectives together and fail to capture the inherent cluster and workload characteristics. In this article, we formulate the job scheduling problem of a cloud-deployed Spark cluster and propose a novel Reinforcement Learning (RL) model to accommodate the SLA objectives. We develop the RL cluster environment and implement two Deep Reinforce Learning (DRL) based schedulers in TF-Agents framework. The proposed DRL-based scheduling agents work at a fine-grained level to place the executors of jobs while leveraging the pricing model of cloud VM instances. In addition, the DRL-based agents can also learn the inherent characteristics of different types of jobs to find a proper placement to reduce both the total cluster VM usage cost and the average job duration. The results show that the proposed DRL-based algorithms can reduce the VM usage cost up to 30%.

引用

页码：1695 / 1710

页数：16

共 50 条

[41] Implementing an intelligent learning-based algorithm for efficient task scheduling in cloud computing environments
Ahmed, Mohammed Waseem
Kavitha, G.
INFORMATION SECURITY JOURNAL, 2025,
[42] Workflow scheduling based on deep reinforcement learning in the cloud environment
Tingting Dong
Fei Xue
Chuangbai Xiao
Jiangjiang Zhang
Journal of Ambient Intelligence and Humanized Computing, 2021, 12 : 10823 - 10835
[43] Deep Reinforcement Learning for Job Scheduling on Cluster
Yao, Zhenjie
Chen, Lan
Zhang, He
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2021, PT IV, 2021, 12894 : 613 - 624
[44] Deep Reinforcement Learning based Energy Scheduling for Edge Computing
Yang, Qinglin
Li, Peng
2020 IEEE INTERNATIONAL CONFERENCE ON SMART CLOUD (SMARTCLOUD 2020), 2020, : 175 - 180
[45] Workflow scheduling based on deep reinforcement learning in the cloud environment
Dong, Tingting
Xue, Fei
Xiao, Chuangbai
Zhang, Jiangjiang
JOURNAL OF AMBIENT INTELLIGENCE AND HUMANIZED COMPUTING, 2021, 12 (12) : 10823 - 10835
[46] Trimmer: Cost-Efficient Deep Learning Auto-tuning for Cloud Datacenters
Borowiec, Damian
Yeung, Gingfung
Friday, Adrian
Harper, Richard H. R.
Garraghan, Peter
2022 IEEE 15TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (IEEE CLOUD 2022), 2022, : 374 - 384
[47] An Efficient Multi Queue Job Scheduling for Cloud Computing
Karthick, A. V.
Ramaraj, E.
Subramanian, R. Ganapathy
2014 WORLD CONGRESS ON COMPUTING AND COMMUNICATION TECHNOLOGIES (WCCCT 2014), 2014, : 164 - +
[48] Energy efficient task scheduling based on deep reinforcement learning in cloud environment: A specialized review
Hou, Huanhuan
Jawaddi, Siti Nuraishah Agos
Ismail, Azlan
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2024, 151 : 214 - 231
[49] A deep reinforcement learning based hybrid algorithm for efficient resource scheduling in edge computing environment
Xue, Fei
Hai, Qiuru
Dong, Tingting
Cui, Zhihua
Gong, Yuelu
INFORMATION SCIENCES, 2022, 608 : 362 - 374
[50] Random task scheduling scheme based on reinforcement learning in cloud computing
Peng, Zhiping
Cui, Delong
Zuo, Jinglong
Li, Qirui
Xu, Bo
Lin, Weiwei
CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2015, 18 (04): : 1595 - 1607

← 1 2 3 4 5 →