Independent checkpointing in a heterogeneous grid environment

被引:4
作者
Feller, Eugen [1 ]
Mehnert-Spahn, John [2 ]
Schoettner, Michael [2 ]
Morin, Christine [1 ]
机构
[1] INRIA, Ctr Rennes Bretagne Atlantique, F-35042 Rennes, France
[2] Univ Dusseldorf, Inst Informat, D-40255 Dusseldorf, Germany
来源
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2012年 / 28卷 / 01期
关键词
Fault tolerance; Backward error recovery; Independent checkpointing; Heterogeneity; Distributed systems; Grid computing; RECOVERY;
D O I
10.1016/j.future.2011.03.012
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The EU-funded XtreemOS project implements an open-source grid operating system based on Linux. In order to provide fault tolerance and migration for grid applications, it integrates a distributed grid-checkpointing service called XtreemGCP. This service is designed to support various checkpointing protocols and different checkpointer packages (e.g. BLCR, LinuxSSI, OpenVZ, etc.) in a transparent manner through a uniform checkpointer interface. In this paper, we present the integration of a backward error recovery protocol based on independent checkpointing into the XtreemGCP service. The solution we propose is not checkpointer bound and thus can be transparently used on top of any checkpointer package. To evaluate the prototype we run it within a heterogeneous environment composed of single-PC nodes and a Single System Image (SSI) cluster. The experimental results demonstrate the capability of the XtreemGCP service to integrate different checkpointing protocols and independently checkpoint a distributed application within a heterogeneous grid environment. Moreover, the performance evaluation also shows that our solution outperforms the existing coordinated checkpointing protocol in terms of scalability. (C) 2011 Elsevier B.V. All rights reserved.
引用
收藏
页码:163 / 170
页数:8
相关论文
共 28 条
[1]  
Agbaria Adnan., 1999, HIGH PERFORMANCE DIS, P31
[2]  
[Anonymous], P LACSI S SANT FE
[3]  
Ansel J., 2009, 23 IEEE INT PAR DIST
[4]  
Bosilca G., 2002, ACMIEEE INT C SUPERC, P1
[5]  
Bouteiller Aurelien., 2003, Supercomputing Conference, P25, DOI [DOI 10.1145/1048935.1050176, 10.1145/1048935.1050176]
[6]  
Ciuffoletti A., 2007, TR0089 COREGRID PROJ
[7]  
Corbalan J., 2007, DESIGN ARCHITECURE A
[8]  
Cortes T., 2008, XTREEMOS VISION GRID
[9]  
Duell J., 2002, The design and implementation of berkeley lab's linuxcheckpoint/restart
[10]   A survey of rollback-recovery protocols in message-passing systems [J].
Elnozahy, EN ;
Alvisi, L ;
Wang, YM ;
Johnson, DB .
ACM COMPUTING SURVEYS, 2002, 34 (03) :375-408