Checkpointing algorithms and fault prediction

被引:17
作者
Aupy, Guillaume [1 ,4 ]
Robert, Yves [1 ,3 ,4 ]
Vivien, Frederic [2 ,4 ]
Zaidouni, Dounia [2 ,4 ]
机构
[1] Ecole Normale Super Lyon, Lyon, France
[2] INRIA, Palaiseau, France
[3] Univ Tennessee, Knoxville, TN 37996 USA
[4] Univ Lyon 1, CNRS, INRIA, LIP,Ecole Normale Super Lyon,UMR5668, F-69365 Lyon, France
关键词
Algorithms; Checkpoint; Prediction; Fault-tolerance; Resilience; Exascale;
D O I
10.1016/j.jpdc.2013.10.010
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
This paper deals with the impact of fault prediction techniques on checkpointing strategies. We extend the classical first-order analysis of Young and Daly in the presence of a fault prediction system, characterized by its recall and its precision. In this framework, we provide optimal algorithms to decide whether and when to take predictions into account, and we derive the optimal value of the checkpointing period. These results allow us to analytically assess the key parameters that impact the performance of fault predictors at very large scale. (C) 2013 Elsevier Inc. All rights reserved.
引用
收藏
页码:2048 / 2064
页数:17
相关论文
共 26 条
  • [1] [Anonymous], 2010, CCGrid, DOI DOI 10.1109/CCGRID.2010.71
  • [2] Bougeret M., 2011, P SC 11
  • [3] PREVENTIVE MIGRATION VS. PREVENTIVE CHECKPOINTING FOR EXTREME SCALE SUPERCOMPUTERS
    Cappello, Franck
    Casanova, Henri
    Robert, Yves
    [J]. PARALLEL PROCESSING LETTERS, 2011, 21 (02) : 111 - 132
  • [4] Proactive management of software aging
    Castelli, V
    Harper, RE
    Heidelberger, P
    Hunter, SW
    Trivedi, KS
    Vaidyanathan, K
    Zeggert, WP
    [J]. IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 2001, 45 (02) : 311 - 332
  • [5] Daly J.T., 2004, FGCS, V22, P303
  • [6] Ferreira K., 2011, P 2011 ACM IEEE C SU
  • [7] Fulp E. W., 2008, P 1 USENIX C AN SYST
  • [8] Gainaru A., 2012, P IPDPS 12
  • [9] Gainaru A, 2012, INT CONF HIGH PERFOR
  • [10] Heath T., 2002, SIGMETRICS PERF EVAL, V30