Checkpointing Workflows for Fail-Stop Errors

被引:18
作者
Han, Li [1 ,2 ,3 ]
Canon, Louis-Claude [4 ]
Casanova, Henri [5 ]
Robert, Yves [1 ,2 ,6 ]
Vivien, Frederic [1 ,2 ]
机构
[1] CNRS, LIP, Ecole Normale Super Lyon, F-75016 Paris, France
[2] INRIA, F-75016 Paris, France
[3] East China Normal Univ, Shanghai 3663, Peoples R China
[4] Univ Bourgogne Franche Comte, FEMTO ST, F-25030 Besancon, France
[5] Univ Hawaii Manoa, Honolulu, HI 96822 USA
[6] Univ Tennessee, Knoxville, TN 37996 USA
关键词
Workflow; checkpoint; fail-stop error; resilience; FAULT-TOLERANCE; COMPLEXITY; ALGORITHM; GRAPH;
D O I
10.1109/TC.2018.2801300
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
We consider the problem of orchestrating the execution of workflow applications structured as Directed Acyclic Graphs (DAGs) on parallel computing platforms that are subject to fail-stop failures. The objective is to minimize expected overall execution time, or makespan. A solution to this problem consists of a schedule of the workflow tasks on the available processors and of a decision of which application data to checkpoint to stable storage, so as to mitigate the impact of processor failures. To address this challenge, we consider a restricted class of graphs Minimal Series-Parallel Graphs (M-SPGs), which is relevant to many real-world workflow applications For this class of graphs, we propose a recursive list-scheduling algorithm that exploits the M-SPG structure to assign sub-graphs to individual processors, and uses dynamic programming to decide how to checkpoint these sub-graphs. We assess the performance of our algorithm for production workflow configurations, comparing it to an approach in which all application data is checkpointed and an approach in which no application data is checkpointed. Results demonstrate that our algorithm outperforms both the former approach, because of lower checkpointing overhead, and the latter approach, because of better resilience to failures.
引用
收藏
页码:1105 / 1120
页数:16
相关论文
共 55 条
[1]  
Altintas I, 2004, P INT C SCI STAT DAT, V16, P423, DOI DOI 10.1109/SSDM.2004.1311241
[2]  
[Anonymous], FAULT TOLERANCE TECH
[3]  
[Anonymous], NOTE COMPLEXITY NET
[4]  
[Anonymous], 2017, Probability and Computing: Randomization and Probabilistic Techniques in Algorithms and Data Analysis
[5]  
[Anonymous], PARALLEL BRANCH BOUN
[6]  
[Anonymous], P 1 ACM SIGMOD WORKS
[7]  
[Anonymous], PARALLEL ALGORITHMS
[8]  
[Anonymous], CHECKPOINTING WORKFL
[9]  
[Anonymous], 2016, SCHEDULING THEORY AL, DOI DOI 10.1007/978-3-319-26580-3
[10]  
[Anonymous], 2014, PEG WORKFL GEN