Fault Tolerance and Recovery in Grid Workflow Management Systems

被引:10
作者
Sindrilaru, Elvin [1 ]
Costan, Alexandru [2 ]
Cristea, Valentin [2 ]
机构
[1] Imperial Coll London, London, England
[2] Univ Polytehn Bucharest, Bucharest, Romania
来源
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPLEX, INTELLIGENT AND SOFTWARE INTENSIVE SYSTEMS (CISIS 2010) | 2010年
关键词
fault tolerance; workflow management systems; dependable systems; BPEL;
D O I
10.1109/CISIS.2010.113
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Complex scientific workflows are now commonly executed on global grids. With the increasing scale complexity, heterogeneity and dynamism of grid environments the challenges of managing and scheduling these workflows are augmented by dependability issues due to the inherent unreliable nature of large-scale grid infrastructure. In addition to the traditional fault tolerance techniques, specific checkpoint-recovery schemes are needed in current grid workflow management systems to address these reliability challenges. Our research aims to design and develop mechanisms for building an autonomic workflow management system that will exhibit the ability to detect, diagnose, notify, react and recover automatically from failures of workflow execution. In this paper we present the development of a Fault Tolerance and Recovery component that extends the ActiveBPEL workflow engine. The detection mechanism relies on inspecting the messages exchanged between the workflow and the orchestrated Web Services in search of faults. The recovery of a process from a faulted state has been achieved by modifying the default behavior of ActiveBPEL and it basically represents a non-intrusive checkpointing mechanism. We present the results of several scenarios that demonstrate the functionality of the Fault Tolerance and Recovery component, outlining an increase in performance of about 50% in comparison to the traditional method of resubmitting the workflow.
引用
收藏
页码:475 / 480
页数:6
相关论文
共 13 条
[1]   Enhancing the fault tolerance of workflow management systems [J].
Alonso, G ;
Hagen, C ;
Agrawal, D ;
El Abbadi, A ;
Mohan, C .
IEEE CONCURRENCY, 2000, 8 (03) :74-81
[2]  
[Anonymous], 2006, ANAL OPTIMISATION AR
[3]  
Coleman Joey W., 2006, EXAMINING BPELS COMP
[4]  
Hwang S., 2003, J GRID COMPUT, V1, P251, DOI DOI 10.1023/B:GRID.0000035187.54694.75
[5]  
Karastoyanova D., 2005, P 9 INT ENT DISTR OB
[6]  
Lee M, 2007, MUE: 2007 INTERNATIONAL CONFERENCE ON MULTIMEDIA AND UBIQUITOUS ENGINEERING, PROCEEDINGS, P272
[7]  
Mehta G., 2007, P 2 WORKSH WORKFL SU
[8]  
Nguyen-Tuong Anh, 2000, INTEGRATING FAULT TO
[9]  
Plankensteiner K., 2007, Fault-tolerant behavior in state-of-the-art grid workflow management systems
[10]  
Sipos G., 2006, J GRID COMPUTING