Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction

被引:10
|
作者
Benacchio, Tommaso [1 ]
Bonaventura, Luca [1 ]
Altenbernd, Mirco [2 ,3 ]
Cantwell, Chris D. [4 ]
Duben, Peter D. [5 ,6 ]
Gillard, Mike [7 ]
Giraud, Luc [8 ]
Goeddeke, Dominik [2 ,3 ]
Raffin, Erwan [9 ]
Teranishi, Keita [10 ]
Wedi, Nils [5 ]
机构
[1] Politecn Milan, Dipartimento Matemat, MOX Modelling & Sci Comp, Piazza Leonardo da Vinci 32, I-20133 Milan, Italy
[2] Univ Stuttgart, Inst Appl Anal & Numer Simulat, Stuttgart, Germany
[3] Univ Stuttgart, Cluster Excellence Data Driven Simulat Sci, Stuttgart, Germany
[4] Imperial Coll London, Dept Aeronaut, London, England
[5] European Ctr Medium Range Weather Forecasts, Reading, Berks, England
[6] Univ Oxford, Dept Phys, AOPP, Oxford, England
[7] Loughborough Univ, Sch Mech Elect & Mfg Engn, Loughborough, Leics, England
[8] Inria Bordeaux, HiePACS, Talence, France
[9] Atos, CEPP Ctr Excellence Performance Programming, Rennes, France
[10] Sandia Natl Labs, Livermore, CA USA
基金
欧盟地平线“2020”;
关键词
Fault-tolerant computing; high-performance computing; application-level resilience; numerical weather prediction; iterative solvers; SCIENTIFIC APPLICATIONS; FAILURE MASKING; DYNAMICAL CORE; RECOVERY; SYSTEMS; MPI; PRECONDITIONER; SCALABILITY; ALGORITHMS; CHALLENGES;
D O I
10.1177/1094342021990433
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Progress in numerical weather and climate prediction accuracy greatly depends on the growth of the available computing power. As the number of cores in top computing facilities pushes into the millions, increased average frequency of hardware and software failures forces users to review their algorithms and systems in order to protect simulations from breakdown. This report surveys hardware, application-level and algorithm-level resilience approaches of particular relevance to time-critical numerical weather and climate prediction systems. A selection of applicable existing strategies is analysed, featuring interpolation-restart and compressed checkpointing for the numerical schemes, in-memory checkpointing, user-level failure mitigation and backup-based methods for the systems. Numerical examples showcase the performance of the techniques in addressing faults, with particular emphasis on iterative solvers for linear systems, a staple of atmospheric fluid flow solvers. The potential impact of these strategies is discussed in relation to current development of numerical weather prediction algorithms and systems towards the exascale. Trade-offs between performance, efficiency and effectiveness of resiliency strategies are analysed and some recommendations outlined for future developments.
引用
收藏
页码:285 / 311
页数:27
相关论文
共 50 条
  • [41] High-performance computing in image registration
    Zanin, Michele
    Remondino, Fabio
    Dalla Mura, Mauro
    HIGH-PERFORMANCE COMPUTING IN REMOTE SENSING II, 2012, 8539
  • [42] Enabling High-Performance Computing as a Service
    AbdelBaky, Moustafa
    Parashar, Manish
    Kim, Hyunjoo
    Jordan, Kirk E.
    Sachdeva, Vipin
    Sexton, James
    Jamjoom, Hani
    Shae, Zon-Yin
    Pencheva, Gergina
    Tavakoli, Reza
    Wheeler, Mary F.
    COMPUTER, 2012, 45 (10) : 72 - 80
  • [43] HIGH-PERFORMANCE COMPUTING ON WALL STREET
    Spiers, Brad
    Wallez, Denis
    COMPUTER, 2010, 43 (12) : 53 - 59
  • [44] A Review of High-Performance Computing Methods for Power Flow Analysis
    Alawneh, Shadi G.
    Zeng, Lei
    Arefifar, Seyed Ali
    MATHEMATICS, 2023, 11 (11)
  • [45] Real-time pneumonia prediction using pipelined spark and high-performance computing
    Ravikumar, Aswathy
    Sriraman, Harini
    PEERJ COMPUTER SCIENCE, 2023, 9
  • [46] Real-time pneumonia prediction using pipelined spark and high-performance computing
    Ravikumar A.
    Sriraman H.
    PeerJ Computer Science, 2023, 9 : 1 - 23
  • [47] Quantum Computing and High-Performance Computing: Compilation Stack Similarities
    Alarcon, Sonia Lopez
    Elster, Anne
    COMPUTING IN SCIENCE & ENGINEERING, 2022, 24 (06) : 66 - 71
  • [48] Scalability of Nek5000 on High-Performance Computing Clusters Toward Direct Numerical Simulation of Molten Pool Convection
    Bian, Boshen
    Gong, Jing
    Villanueva, Walter
    FRONTIERS IN ENERGY RESEARCH, 2022, 10
  • [49] Representation of Boundary-Layer Processes in Numerical Weather Prediction and Climate Models
    John M. Edwards
    Anton C. M. Beljaars
    Albert A. M. Holtslag
    Adrian P. Lock
    Boundary-Layer Meteorology, 2020, 177 : 511 - 539
  • [50] Representation of Boundary-Layer Processes in Numerical Weather Prediction and Climate Models
    Edwards, John M.
    Beljaars, Anton C. M.
    Holtslag, Albert A. M.
    Lock, Adrian P.
    BOUNDARY-LAYER METEOROLOGY, 2020, 177 (2-3) : 511 - 539