Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction

被引:10
|
作者
Benacchio, Tommaso [1 ]
Bonaventura, Luca [1 ]
Altenbernd, Mirco [2 ,3 ]
Cantwell, Chris D. [4 ]
Duben, Peter D. [5 ,6 ]
Gillard, Mike [7 ]
Giraud, Luc [8 ]
Goeddeke, Dominik [2 ,3 ]
Raffin, Erwan [9 ]
Teranishi, Keita [10 ]
Wedi, Nils [5 ]
机构
[1] Politecn Milan, Dipartimento Matemat, MOX Modelling & Sci Comp, Piazza Leonardo da Vinci 32, I-20133 Milan, Italy
[2] Univ Stuttgart, Inst Appl Anal & Numer Simulat, Stuttgart, Germany
[3] Univ Stuttgart, Cluster Excellence Data Driven Simulat Sci, Stuttgart, Germany
[4] Imperial Coll London, Dept Aeronaut, London, England
[5] European Ctr Medium Range Weather Forecasts, Reading, Berks, England
[6] Univ Oxford, Dept Phys, AOPP, Oxford, England
[7] Loughborough Univ, Sch Mech Elect & Mfg Engn, Loughborough, Leics, England
[8] Inria Bordeaux, HiePACS, Talence, France
[9] Atos, CEPP Ctr Excellence Performance Programming, Rennes, France
[10] Sandia Natl Labs, Livermore, CA USA
基金
欧盟地平线“2020”;
关键词
Fault-tolerant computing; high-performance computing; application-level resilience; numerical weather prediction; iterative solvers; SCIENTIFIC APPLICATIONS; FAILURE MASKING; DYNAMICAL CORE; RECOVERY; SYSTEMS; MPI; PRECONDITIONER; SCALABILITY; ALGORITHMS; CHALLENGES;
D O I
10.1177/1094342021990433
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Progress in numerical weather and climate prediction accuracy greatly depends on the growth of the available computing power. As the number of cores in top computing facilities pushes into the millions, increased average frequency of hardware and software failures forces users to review their algorithms and systems in order to protect simulations from breakdown. This report surveys hardware, application-level and algorithm-level resilience approaches of particular relevance to time-critical numerical weather and climate prediction systems. A selection of applicable existing strategies is analysed, featuring interpolation-restart and compressed checkpointing for the numerical schemes, in-memory checkpointing, user-level failure mitigation and backup-based methods for the systems. Numerical examples showcase the performance of the techniques in addressing faults, with particular emphasis on iterative solvers for linear systems, a staple of atmospheric fluid flow solvers. The potential impact of these strategies is discussed in relation to current development of numerical weather prediction algorithms and systems towards the exascale. Trade-offs between performance, efficiency and effectiveness of resiliency strategies are analysed and some recommendations outlined for future developments.
引用
收藏
页码:285 / 311
页数:27
相关论文
共 50 条
  • [21] TRENDS IN HIGH-PERFORMANCE COMPUTING
    Kindratenko, Volodymyr
    Trancoso, Pedro
    COMPUTING IN SCIENCE & ENGINEERING, 2011, 13 (03) : 92 - 95
  • [22] High-Performance Computing with TeraStat
    Bompiani, Edoardo
    Petrillo, Umberto Ferraro
    Lasinio, Giovanna Jona
    Palini, Francesco
    2020 IEEE INTL CONF ON DEPENDABLE, AUTONOMIC AND SECURE COMPUTING, INTL CONF ON PERVASIVE INTELLIGENCE AND COMPUTING, INTL CONF ON CLOUD AND BIG DATA COMPUTING, INTL CONF ON CYBER SCIENCE AND TECHNOLOGY CONGRESS (DASC/PICOM/CBDCOM/CYBERSCITECH), 2020, : 499 - 506
  • [23] Automating fault tolerance in high-performance computational biological jobs using multi-agent approaches
    Varghese, Blesson
    Mckee, Gerard
    Alexandrov, Vassil
    COMPUTERS IN BIOLOGY AND MEDICINE, 2014, 48 : 28 - 41
  • [24] A Two-Level Fault-Tolerance Technique for High Performance Computing Applications
    Aseeri, Aishah M.
    Fadel, Mai A.
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2018, 9 (12) : 46 - 54
  • [25] The marketplace of high-performance computing
    Strohmaier, E
    Dongarra, JJ
    Meuer, HW
    Simon, HD
    PARALLEL COMPUTING, 1999, 25 (13-14) : 1517 - 1544
  • [26] Future microwave instruments for numerical weather prediction and climate research
    Klein, U
    Lin, CC
    Charlton, J
    Goutoule, JM
    Atkinson, N
    Eymard, L
    SENSORS, SYSTEMS AND NEXT-GENERATION SATELLITES VI, 2003, 4881 : 232 - 243
  • [27] Toward a Performance/Resilience Tool for Hardware/Software Co-Design of High-Performance Computing Systems
    Engelmann, Christian
    Naughton, Thomas
    2013 42ND ANNUAL INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP), 2013, : 960 - 969
  • [28] Massively parallel solvers for elliptic partial differential equations in numerical weather and climate prediction
    Mueller, Eike H.
    Scheichl, Robert
    QUARTERLY JOURNAL OF THE ROYAL METEOROLOGICAL SOCIETY, 2014, 140 (685) : 2608 - 2624
  • [29] Scaling modeling and simulation on high-performance computing clusters
    Mikailov, Mike
    Qiu, Junshan
    Luo, Fu-Jyh
    Whitney, Stephen
    Petrick, Nicholas
    SIMULATION-TRANSACTIONS OF THE SOCIETY FOR MODELING AND SIMULATION INTERNATIONAL, 2020, 96 (02): : 221 - 232
  • [30] Special issue editorial: Accelerators for high-performance computing
    Doallo, Ramon
    Fraguela, Basilio B.
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2012, 72 (09) : 1055 - 1056