Using group replication for resilience on exascale systems

被引:4
作者
Bougeret, Marin [1 ]
Casanova, Henri [2 ]
Robert, Yves [3 ,6 ]
Vivien, Frederic [4 ,7 ]
Zaidouni, Dounia [5 ,7 ]
机构
[1] LIRMM Montpellier, Montpellier, France
[2] Univ Hawaii Manoa, Informat & Comp Sci Dept, Honolulu, HI 96822 USA
[3] Ecole Normale Super Lyon, Comp Sci Lab LIP, F-69364 Lyon 07, France
[4] Ecole Normale Super Lyon, INRIA, F-69364 Lyon 07, France
[5] Ecole Normale Super Lyon, Dept Comp Sci, F-69364 Lyon 07, France
[6] Univ Tennessee, Knoxville, TN USA
[7] INRIA, Paris, France
关键词
Checkpointing; replication; exascale platforms; resilience;
D O I
10.1177/1094342013505348
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
High performance computing applications must be resilient to faults. The traditional fault-tolerance solution is checkpoint-recovery, by which application state is saved to and recovered from secondary storage throughout execution. It has been shown that, even when using an optimal checkpointing strategy, the checkpointing overhead precludes high parallel efficiency at large scale. Additional fault-tolerance mechanisms must thus be used. Such a mechanism is replication, that is, multiple processors performing the same computation so that a processor failure does not necessarily imply an application failure. In spite of resource waste, replication can lead to higher parallel efficiency when compared to using only checkpoint-recovery at large scale. We propose to execute and checkpoint multiple application instances concurrently, an approach we term group replication. For exponential failures we give an upper bound on the expected application execution time. This bound corresponds to a particular checkpointing period that we derive. For general failures, we propose a dynamic programming algorithm to determine non-periodic checkpoint dates as well as an empirical periodic checkpointing solution whose period is found via a numerical search. Using simulation we evaluate our proposed approaches, including comparison to the non-replication case, for both exponential and Weibull failure distributions. Our broad finding is that group replication is useful in a range of realistic application and checkpointing overhead scenarios for future exascale platforms.
引用
收藏
页码:210 / 224
页数:15
相关论文
共 50 条
  • [41] Resilience in Complex Catchment Systems
    Beevers, Lindsay
    Bedinger, Melissa
    McClymont, Kerri
    Visser-Quinn, Annie
    WATER, 2021, 13 (04)
  • [42] Using Thresholds of Severity to Threats to and the Resilience of Human Systems in Measuring Human Security
    Pedcris M. Orencio
    Aiko Endo
    Makoto Taniguchi
    Masahiko Fujii
    Social Indicators Research, 2016, 129 : 979 - 999
  • [43] "You keep using that word ...": Disjointed definitions of resilience in food systems adaptation
    Soubry, Bernard
    Sherren, Kate
    LAND USE POLICY, 2022, 114
  • [44] Resilience metrics for cyber systems
    Linkov I.
    Eisenberg D.A.
    Plourde K.
    Seager T.P.
    Allen J.
    Kott A.
    Environment Systems and Decisions, 2013, 33 (4) : 471 - 476
  • [45] Beyond Resilience in Sociotechnical Systems
    Simonette, Marcel
    Magalhaes, Mario
    Bertassi, Eduardo
    Spina, Edison
    2019 5TH IEEE INTERNATIONAL SYMPOSIUM ON SYSTEMS ENGINEERING (IEEE ISSE 2019), 2019,
  • [46] Disaster Risk Management: Using a Resilience Systems Approach to Plan for Multiple Stressors
    Vivekananda, Janani
    IMPLICATIONS OF CLIMATE CHANGE AND DISASTERS ON MILITARY ACTIVITIES: BUILDING RESILIENCY AND MITIGATING VULNERABILITY IN THE BALKAN REGION, 2017, : 75 - 78
  • [47] Resilience analysis: a mathematical formulation to model resilience of engineering systems
    Sharma, Neetesh
    Tabandeh, Armin
    Gardoni, Paolo
    SUSTAINABLE AND RESILIENT INFRASTRUCTURE, 2018, 3 (02) : 49 - 67
  • [48] Resilience and Systems-A Review
    Mayar, Khalilullah
    Carmichael, David G.
    Shen, Xuesong
    SUSTAINABILITY, 2022, 14 (14)
  • [49] Analysis of Resilience Situations for Complex Engineered Systems - the Resilience Holon
    Freeman, Rachel
    Varga, Liz
    IEEE SYSTEMS JOURNAL, 2022, 16 (02): : 2265 - 2276
  • [50] Resilience assessment of chemical processes using operable adaptive sparse identification of systems
    Pawar, Bhushan
    Bhadriraju, Bhavana
    Khan, Faisal
    Sang-II Kwon, Joseph
    Wang, Qingsheng
    COMPUTERS & CHEMICAL ENGINEERING, 2023, 177