Model-based performance evaluation of distributed checkpointing protocols

被引:7
作者
Agbaria, Adnan [1 ]
Friedman, Roy [2 ]
机构
[1] IBM Corp, Haifa Res Lab, IL-31905 Haifa, Israel
[2] Technion Israel Inst Technol, Dept Comp Sci, IL-32000 Haifa, Israel
关键词
distributed checkpoint/restart; rollback propagation; performance analysis; Markov models;
D O I
10.1016/j.peva.2007.09.001
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
A large number of distributed checkpointing protocols have appeared in the literature. However, to make informed decisions about which protocol performs best for a given environment, one must use an objective measure for comparing them. Obviously, a distributed checkpointing protocol could be the best in a specific environment, but not in another environment. This paper presents an objective measure, called overhead ratio, for evaluating distributed checkpointing protocols. This measure extends previous evaluation schemes by incorporating several additional parameters that are inherent in distributed environments. In particular, we take into account the rollback propagation of the protocol, which impacts the length of the recovery process, and therefore the expected program run-time in executions that involve failures and recoveries. Using the objective measure as an evaluation technique, the paper also analyses several known protocols and compares their overhead ratios. (C) 2007 Elsevier B.V. All rights reserved.
引用
收藏
页码:345 / 365
页数:21
相关论文
共 42 条
  • [1] AGBARIA, 2000, P 1 IEEE C DEP SYST, P49
  • [2] Quantifying rollback propagation in distributed checkpointing
    Agbaria, A
    Attiya, H
    Friedman, R
    Vitenberg, R
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2004, 64 (03) : 370 - 384
  • [3] AGBARIA A, 2003, P 23 INT C DISTR COM, P266
  • [4] AGBARIA A, 2002, SOFTWARE PRACTICE EX, V32, P1
  • [5] Agbaria A. M., 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469), P167, DOI 10.1109/HPDC.1999.805295
  • [6] An analysis of communication induced checkpointing
    Alvisi, L
    Elnozahy, E
    Rao, S
    Husain, SA
    De Mel, A
    [J]. TWENTY-NINTH ANNUAL INTERNATIONAL SYMPOSIUM ON FAULT-TOLERANT COMPUTING, DIGEST OF PAPERS, 1999, : 242 - 249
  • [7] A VP-accordant checkpointing protocol preventing useless checkpoints
    Baldoni, R
    Quaglia, F
    Ciciani, B
    [J]. SEVENTEENTH IEEE SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, PROCEEDINGS, 1998, : 61 - 67
  • [8] Baldoni R, 1997, DIG PAP INT SYMP FAU, P68, DOI 10.1109/FTCS.1997.614079
  • [9] BREVIK J, 2003, 200337 U CAL DEP COM
  • [10] Briatico D., 1984, Proceedings of the Fourth Symposium on Reliability in Distributed Software and Database Systems (Cat. No. 84CH2082-6), P207