Using group replication for resilience on exascale systems

被引:4
作者
Bougeret, Marin [1 ]
Casanova, Henri [2 ]
Robert, Yves [3 ,6 ]
Vivien, Frederic [4 ,7 ]
Zaidouni, Dounia [5 ,7 ]
机构
[1] LIRMM Montpellier, Montpellier, France
[2] Univ Hawaii Manoa, Informat & Comp Sci Dept, Honolulu, HI 96822 USA
[3] Ecole Normale Super Lyon, Comp Sci Lab LIP, F-69364 Lyon 07, France
[4] Ecole Normale Super Lyon, INRIA, F-69364 Lyon 07, France
[5] Ecole Normale Super Lyon, Dept Comp Sci, F-69364 Lyon 07, France
[6] Univ Tennessee, Knoxville, TN USA
[7] INRIA, Paris, France
关键词
Checkpointing; replication; exascale platforms; resilience;
D O I
10.1177/1094342013505348
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
High performance computing applications must be resilient to faults. The traditional fault-tolerance solution is checkpoint-recovery, by which application state is saved to and recovered from secondary storage throughout execution. It has been shown that, even when using an optimal checkpointing strategy, the checkpointing overhead precludes high parallel efficiency at large scale. Additional fault-tolerance mechanisms must thus be used. Such a mechanism is replication, that is, multiple processors performing the same computation so that a processor failure does not necessarily imply an application failure. In spite of resource waste, replication can lead to higher parallel efficiency when compared to using only checkpoint-recovery at large scale. We propose to execute and checkpoint multiple application instances concurrently, an approach we term group replication. For exponential failures we give an upper bound on the expected application execution time. This bound corresponds to a particular checkpointing period that we derive. For general failures, we propose a dynamic programming algorithm to determine non-periodic checkpoint dates as well as an empirical periodic checkpointing solution whose period is found via a numerical search. Using simulation we evaluate our proposed approaches, including comparison to the non-replication case, for both exponential and Weibull failure distributions. Our broad finding is that group replication is useful in a range of realistic application and checkpointing overhead scenarios for future exascale platforms.
引用
收藏
页码:210 / 224
页数:15
相关论文
共 50 条
  • [31] Using Thresholds of Severity to Threats to and the Resilience of Human Systems in Measuring Human Security
    Orencio, Pedcris M.
    Endo, Aiko
    Taniguchi, Makoto
    Fujii, Masahiko
    SOCIAL INDICATORS RESEARCH, 2016, 129 (03) : 979 - 999
  • [32] Resilience Modeling in Complex Systems
    Mirchandani, Chandru
    COMPLEX ADAPTIVE SYSTEMS, 2020, 168 : 232 - 240
  • [33] Enhancing seismic resilience using truss girder frame systems with supplemental devices
    Pekcan, Gokhan
    Itani, Ahmad M.
    Linke, Christin
    JOURNAL OF CONSTRUCTIONAL STEEL RESEARCH, 2014, 94 : 23 - 32
  • [34] Resilience-Oriented Critical Load Restoration Using Microgrids in Distribution Systems
    Gao, Haixiang
    Chen, Ying
    Xu, Yin
    Liu, Chen-Ching
    IEEE TRANSACTIONS ON SMART GRID, 2016, 7 (06) : 2837 - 2848
  • [35] Analyzing resilience properties in oscillatory biological systems using parametric model checking
    Andreychenko, Alexander
    Magnin, Morgan
    Inoue, Katsumi
    BIOSYSTEMS, 2016, 149 : 50 - 58
  • [36] Improving the Resilience of Postdisaster Water Distribution Systems Using Dynamic Optimization Framework
    Zhang, Qingzhou
    Zheng, Feifei
    Chen, Qiuwen
    Kapelan, Zoran
    Diao, Kegong
    Zhang, Kejia
    Huang, Yuan
    JOURNAL OF WATER RESOURCES PLANNING AND MANAGEMENT, 2020, 146 (02)
  • [37] Resilience in Optical Wireless Systems
    Saeed, Sarah O. M.
    Mohamed, Sanaa Hamid
    Alsulami, Osama Zwaid
    Alresheedi, Mohammed T.
    Elgorashi, Taisir E. H.
    Elmirghani, Jaafar M. H.
    2020 22ND INTERNATIONAL CONFERENCE ON TRANSPARENT OPTICAL NETWORKS (ICTON 2020), 2020,
  • [38] Resilience in Operators, Technologies, and Systems
    Hancock, P. A.
    Cruit, Jessica
    IEEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS, 2024, 54 (05) : 565 - 581
  • [39] Resilience principles for engineered systems
    Jackson, Scott
    Ferris, Timothy L. J.
    SYSTEMS ENGINEERING, 2013, 16 (02) : 152 - 164
  • [40] Resilience and transparency in social systems
    Griffiths, Dai
    KYBERNETES, 2019, 48 (04) : 715 - 726