Evaluating distributed checkpointing Protocols

被引:0
作者
Agbaria, A [1 ]
Freund, A [1 ]
Friedman, R [1 ]
机构
[1] Univ Illinois, Coordinated Sci Lab, Urbana, IL 61801 USA
来源
23RD INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS, PROCEEDINGS | 2002年
关键词
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
This paper presents an objective measure, called overhead ratio, for evaluating distributed checkpointing protocols. This measure extends previous evaluation schemes by incorporating several additional parameters that are inherent in distributed environments. In particular we take into account the rollback propagation of the protocol, which impacts the length of the recovery process, and therefore the expected program run-time in executions that involve failures and recoveries. The paper also analyzes several known protocols and compares their overhead ratio.
引用
收藏
页码:266 / 273
页数:8
相关论文
共 21 条
[1]   Quantifying rollback propagation in distributed checkpointing [J].
Agbaria, A ;
Attiya, H ;
Friedman, R ;
Vitenberg, R .
20TH IEEE SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, PROCEEDINGS, 2001, :36-45
[2]  
AGBARIA A, 2002, INT PAR DISTR PROC S, P22
[3]  
AGBARIA A, 1999, 8 IEEE INT S HIGH PE, P167
[4]  
AIDYA NH, 1994, TR94068 TEX A M U DE
[5]   An analysis of communication induced checkpointing [J].
Alvisi, L ;
Elnozahy, E ;
Rao, S ;
Husain, SA ;
De Mel, A .
TWENTY-NINTH ANNUAL INTERNATIONAL SYMPOSIUM ON FAULT-TOLERANT COMPUTING, DIGEST OF PAPERS, 1999, :242-249
[6]   A VP-accordant checkpointing protocol preventing useless checkpoints [J].
Baldoni, R ;
Quaglia, F ;
Ciciani, B .
SEVENTEENTH IEEE SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, PROCEEDINGS, 1998, :61-67
[7]  
Briatico D., 1984, Proceedings of the Fourth Symposium on Reliability in Distributed Software and Database Systems (Cat. No. 84CH2082-6), P207
[8]   DISTRIBUTED SNAPSHOTS - DETERMINING GLOBAL STATES OF DISTRIBUTED SYSTEMS [J].
CHANDY, KM ;
LAMPORT, L .
ACM TRANSACTIONS ON COMPUTER SYSTEMS, 1985, 3 (01) :63-75
[9]  
ELNOZAHY EN, 1999, CMUCS99148 CARN U DE
[10]   TIME, CLOCKS, AND ORDERING OF EVENTS IN A DISTRIBUTED SYSTEM [J].
LAMPORT, L .
COMMUNICATIONS OF THE ACM, 1978, 21 (07) :558-565