Performance and Cost-Efficient Spark Job Scheduling Based on Deep Reinforcement Learning in Cloud Computing Environments

被引:57
作者
Islam, Muhammed Tawfiqul [1 ]
Karunasekera, Shanika [1 ]
Buyya, Rajkumar [1 ]
机构
[1] Univ Melbourne, Sch Comp & Informat Syst, Cloud Comp & Distributed Syst CLOUDS, Melbourne, Vic 3010, Australia
基金
澳大利亚研究理事会;
关键词
Sparks; Cloud computing; Costs; Task analysis; Service level agreements; Big Data; Reinforcement learning; cost-efficiency; performance improvement; deep reinforcement learning;
D O I
10.1109/TPDS.2021.3124670
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Big data frameworks such as Spark and Hadoop are widely adopted to run analytics jobs in both research and industry. Cloud offers affordable compute resources which are easier to manage. Hence, many organizations are shifting towards a cloud deployment of their big data computing clusters. However, job scheduling is a complex problem in the presence of various Service Level Agreement (SLA) objectives such as monetary cost reduction, and job performance improvement. Most of the existing research does not address multiple objectives together and fail to capture the inherent cluster and workload characteristics. In this article, we formulate the job scheduling problem of a cloud-deployed Spark cluster and propose a novel Reinforcement Learning (RL) model to accommodate the SLA objectives. We develop the RL cluster environment and implement two Deep Reinforce Learning (DRL) based schedulers in TF-Agents framework. The proposed DRL-based scheduling agents work at a fine-grained level to place the executors of jobs while leveraging the pricing model of cloud VM instances. In addition, the DRL-based agents can also learn the inherent characteristics of different types of jobs to find a proper placement to reduce both the total cluster VM usage cost and the average job duration. The results show that the proposed DRL-based algorithms can reduce the VM usage cost up to 30%.
引用
收藏
页码:1695 / 1710
页数:16
相关论文
共 40 条
  • [1] [Anonymous], 2012, Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing (NSDI'12)
  • [2] Bao YX, 2019, IEEE INFOCOM SER, P505, DOI [10.1109/INFOCOM.2019.8737460, 10.1109/infocom.2019.8737460]
  • [3] Burer S., 2012, Surveys in Operations Research and Management Science, V17, P97, DOI [DOI 10.1016/J.SORMS.2012.08.001, 10.1016/j.sorms.2012.08.001]
  • [4] Scheduling Semiconductor Testing Facility by Using Cuckoo Search Algorithm With Reinforcement Learning and Surrogate Modeling
    Cao, ZhengCai
    Lin, ChengRan
    Zhou, MengChu
    Huang, Ran
    [J]. IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, 2019, 16 (02) : 825 - 837
  • [5] A Novel Task Provisioning Approach Fusing Reinforcement Learning for Big Data
    Cheng, Yongyi
    Xu, Gaochao
    [J]. IEEE ACCESS, 2019, 7 : 143699 - 143709
  • [6] A cost-benefit analysis of using cloud computing to extend the capacity of clusters
    de Assuncao, Marcos Dias
    di Costanzo, Alexandre
    Buyya, Rajkumar
    [J]. CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2010, 13 (03): : 335 - 347
  • [7] Quasar: Resource-Efficient and QoS-Aware Cluster Management
    Delimitrou, Christina
    Kozyrakis, Christos
    [J]. ACM SIGPLAN NOTICES, 2014, 49 (04) : 127 - 143
  • [8] Justice: A Deadline-aware, Fair-share Resource Allocator for Implementing Multi-analytics
    Dimopoulos, Stratos
    Krintz, Chandra
    Wolski, Rich
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2017, : 233 - 244
  • [9] George L., 2011, HBase: The Definitive Guide: Random Access to Your Planet-Size Data
  • [10] Ghodsi Ali, 2011, 8 USENIX S NETW SYST, P323