Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction

被引:10
|
作者
Benacchio, Tommaso [1 ]
Bonaventura, Luca [1 ]
Altenbernd, Mirco [2 ,3 ]
Cantwell, Chris D. [4 ]
Duben, Peter D. [5 ,6 ]
Gillard, Mike [7 ]
Giraud, Luc [8 ]
Goeddeke, Dominik [2 ,3 ]
Raffin, Erwan [9 ]
Teranishi, Keita [10 ]
Wedi, Nils [5 ]
机构
[1] Politecn Milan, Dipartimento Matemat, MOX Modelling & Sci Comp, Piazza Leonardo da Vinci 32, I-20133 Milan, Italy
[2] Univ Stuttgart, Inst Appl Anal & Numer Simulat, Stuttgart, Germany
[3] Univ Stuttgart, Cluster Excellence Data Driven Simulat Sci, Stuttgart, Germany
[4] Imperial Coll London, Dept Aeronaut, London, England
[5] European Ctr Medium Range Weather Forecasts, Reading, Berks, England
[6] Univ Oxford, Dept Phys, AOPP, Oxford, England
[7] Loughborough Univ, Sch Mech Elect & Mfg Engn, Loughborough, Leics, England
[8] Inria Bordeaux, HiePACS, Talence, France
[9] Atos, CEPP Ctr Excellence Performance Programming, Rennes, France
[10] Sandia Natl Labs, Livermore, CA USA
基金
欧盟地平线“2020”;
关键词
Fault-tolerant computing; high-performance computing; application-level resilience; numerical weather prediction; iterative solvers; SCIENTIFIC APPLICATIONS; FAILURE MASKING; DYNAMICAL CORE; RECOVERY; SYSTEMS; MPI; PRECONDITIONER; SCALABILITY; ALGORITHMS; CHALLENGES;
D O I
10.1177/1094342021990433
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Progress in numerical weather and climate prediction accuracy greatly depends on the growth of the available computing power. As the number of cores in top computing facilities pushes into the millions, increased average frequency of hardware and software failures forces users to review their algorithms and systems in order to protect simulations from breakdown. This report surveys hardware, application-level and algorithm-level resilience approaches of particular relevance to time-critical numerical weather and climate prediction systems. A selection of applicable existing strategies is analysed, featuring interpolation-restart and compressed checkpointing for the numerical schemes, in-memory checkpointing, user-level failure mitigation and backup-based methods for the systems. Numerical examples showcase the performance of the techniques in addressing faults, with particular emphasis on iterative solvers for linear systems, a staple of atmospheric fluid flow solvers. The potential impact of these strategies is discussed in relation to current development of numerical weather prediction algorithms and systems towards the exascale. Trade-offs between performance, efficiency and effectiveness of resiliency strategies are analysed and some recommendations outlined for future developments.
引用
收藏
页码:285 / 311
页数:27
相关论文
共 50 条
  • [31] Mathematical Tools for Simulation of 3D Bioprinting Processes on High-Performance Computing Resources: The State of the Art
    Carracciuolo, Luisa
    D'Amora, Ugo
    APPLIED SCIENCES-BASEL, 2024, 14 (14):
  • [32] Numerical algorithms for high-performance computational science
    Dongarra, Jack
    Grigori, Laura
    Higham, Nicholas J.
    PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY A-MATHEMATICAL PHYSICAL AND ENGINEERING SCIENCES, 2020, 378 (2166):
  • [33] Ad Hoc File Systems for High-Performance Computing
    Brinkmann, Andre
    Mohror, Kathryn
    Yu, Weikuan
    Carns, Philip
    Cortes, Toni
    Klasky, Scott A.
    Miranda, Alberto
    Pfreundt, Franz-Josef
    Ross, Robert B.
    Vef, Marc-Andre
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2020, 35 (01) : 4 - 26
  • [34] Numerical solution of high-temperature gas dynamics problems on high-performance computing systems
    Chetverushkin, Boris N.
    Olkhovskaya, Olga G.
    Tsigvintsev, Il'ya P.
    JOURNAL OF COMPUTATIONAL AND APPLIED MATHEMATICS, 2021, 390
  • [35] High-Performance Computing MRI Simulations
    Stoecker, Tony
    Vahedipour, Kaveh
    Pflugfelder, Daniel
    Shah, N. Jon
    MAGNETIC RESONANCE IN MEDICINE, 2010, 64 (01) : 186 - 193
  • [36] The Growth of High-Performance Computing in Africa
    Amolo, George O.
    COMPUTING IN SCIENCE & ENGINEERING, 2018, 20 (03) : 21 - 24
  • [37] Taming complexity in high-performance computing
    Oldehoeft, R
    MATHEMATICS AND COMPUTERS IN SIMULATION, 2000, 54 (4-5) : 341 - 357
  • [38] Autotuning in High-Performance Computing Applications
    Balaprakash, Prasanna
    Dongarra, Jack
    Gamblin, Todd
    Hall, Mary
    Hollingsworth, Jeffrey K.
    Norris, Boyana
    Vuduc, Richard
    PROCEEDINGS OF THE IEEE, 2018, 106 (11) : 2068 - 2083
  • [39] The promise of high-performance reconfigurable computing
    El-Ghazawi, Tarek
    El-Araby, Esam
    Huang, Miaoqing
    Gaj, Kris
    Kindratenko, Volodymyr
    Buell, Duncan
    COMPUTER, 2008, 41 (02) : 69 - +
  • [40] High-Performance Distributed Computing with Smartphones
    Ishikawa, Nadeem
    Nomura, Hayato
    Yoda, Yuya
    Uetsuki, Osamu
    Fukunaga, Keisuke
    Nagoya, Seiji
    Sawara, Junya
    Ishihata, Hiroaki
    Senoguchi, Junsuke
    EURO-PAR 2023: PARALLEL PROCESSING WORKSHOPS, PT II, EURO-PAR 2023, 2024, 14352 : 229 - 232