Deep reinforcement learning for fault-tolerant workflow scheduling in cloud environment

被引:18
作者
Dong, Tingting [1 ,2 ]
Xue, Fei [1 ]
Tang, Hengliang [1 ]
Xiao, Chuangbai [2 ]
机构
[1] Beijing Wuzi Univ, Beijing, Peoples R China
[2] Beijing Univ Technol, Beijing, Peoples R China
关键词
Fault-tolerant strategy; Workflow scheduling; Resubmission; Replication; Deep reinforcement learning; ENERGY; COST;
D O I
10.1007/s10489-022-03963-w
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Cloud computing is widely used in various fields, which can provide sufficient computing resources to address users' demands (workflows) quickly and effectively. However, resource failure is inevitable, and a challenge to optimize the workflow scheduling is to consider the fault tolerance. Most of previous algorithms are based on failure prediction and fault-tolerant strategies, which can cause the time delay and waste of resources. In this paper, combining the above two methods through a deep reinforcement learning framework, an adaptive fault-tolerant workflow scheduling framework called RLFTWS is proposed, aiming to minimize the makespan and resource usage rate. In this framework, the fault-tolerant workflow scheduling is formulated as a markov decision process. Resubmission and replication strategy are as two actions. A heuristic algorithm is designed for the task allocation and execution according to the selected fault-tolerant strategy. And, double deep Q network framework (DDQN) is developed to select the fault-tolerant strategy adaptively for each task under the current environment state, which is not only prediction but also learning in the process of interacting with the environment. Simulation results show that the proposed RLFTWS can efficiently balance the makespan and resource usage rate, and achieve fault tolerance.
引用
收藏
页码:9916 / 9932
页数:17
相关论文
共 34 条
  • [1] Machine learning for combinatorial optimization: A methodological tour d'horizon
    Bengio, Yoshua
    Lodi, Andrea
    Prouvost, Antoine
    [J]. EUROPEAN JOURNAL OF OPERATIONAL RESEARCH, 2021, 290 (02) : 405 - 421
  • [2] Cost-effective workflow scheduling approach on cloud under deadline constraint using firefly algorithm
    Chakravarthi, Koneti Kalyan
    Shyamala, L.
    Vaidehi, V.
    [J]. APPLIED INTELLIGENCE, 2021, 51 (03) : 1629 - 1644
  • [3] Dynamic and Fault-Tolerant Clustering for Scientific Workflows
    Chen, Weiwei
    da Silva, Rafael Ferreira
    Deelman, Ewa
    Fahringer, Thomas
    [J]. IEEE TRANSACTIONS ON CLOUD COMPUTING, 2016, 4 (01) : 49 - 62
  • [4] Modeling and Analyzing Dynamic Fault-Tolerant Strategy for Deadline Constrained Task Scheduling in Cloud Computing
    Fan, Guisheng
    Chen, Liqiong
    Yu, Huiqun
    Liu, Dongmei
    [J]. IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2020, 50 (04): : 1260 - 1274
  • [5] Failure Management for Reliable Cloud Computing: A Taxonomy, Model, and Future Directions
    Gill, Sukhpal Singh
    Buyya, Rajkumar
    [J]. COMPUTING IN SCIENCE & ENGINEERING, 2020, 22 (03) : 52 - 62
  • [6] Minimizing Resource Consumption Cost of DAG Applications With Reliability Requirement on Heterogeneous Processor Systems
    Hu, Biao
    Cao, Zhengcai
    [J]. IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2020, 16 (12) : 7437 - 7447
  • [7] Impact of Failure Prediction on Availability: Modeling and Comparative Analysis of Predictive and Reactive Methods
    Kaitovic, Igor
    Malek, Miroslaw
    [J]. IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2020, 17 (03) : 493 - 505
  • [8] Reinforcement Learning based scheduling in a workflow management system
    Kintsakis, Athanassios M.
    Psomopoulos, Fotis E.
    Mitkas, Pericles A.
    [J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2019, 81 : 94 - 106
  • [9] Holistic energy and failure aware workload scheduling in Cloud datacenters
    Li, Xiang
    Jiang, Xiaohong
    Garraghan, Peter
    Wu, Zhaohui
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2018, 78 : 887 - 900
  • [10] Using Proactive Fault-Tolerance Approach to Enhance Cloud Service Reliability
    Liu, Jialei
    Wang, Shangguang
    Zhou, Ao
    Kumar, Sathish A. P.
    Yang, Fangchun
    Buyya, Rajkumar
    [J]. IEEE TRANSACTIONS ON CLOUD COMPUTING, 2018, 6 (04) : 1191 - 1202