FAULT-TOLERANT PARALLEL MULTIGRID METHOD ON UNSTRUCTURED ADAPTIVE MESH

被引:0
作者
Fung, Frederick [1 ,2 ]
Stals, Linda [2 ]
Deng, Quanling [3 ]
机构
[1] Australian Natl Univ, Math Sci Inst, Canberra, ACT 2601, Australia
[2] Australian Natl Univ, Natl Computat Infrastruct, Canberra, ACT 2601, Australia
[3] Australian Natl Univ, Sch Comp, Canberra, ACT 2601, Australia
关键词
algorithmic-based fault tolerance; unstructured adaptive meshes; geometric multigrid; DAVIDSON METHOD; RECOVERY;
D O I
10.1137/23M1582904
中图分类号
O29 [应用数学];
学科分类号
070104 ;
摘要
As the generation of exascale high-performance clusters begins, it has become evident that numerical algorithms will greatly benefit from built-in resilience features that can handle system faults. Prior studies of fault-tolerant multigrid methods have focused on structured grids. In this work, however, we study the resilience of multigrid solvers on unstructured grids with adaptive refinement. The challenge lies in the fact that unstructured grids distributed across multiple processors may manifest as local hierarchical grids with unaligned boundaries. Our numerical experiments highlight that this disparity can result in divergence when employing standard local multigrid for fault recovery. We analyze this phenomenon by using an energy control condition. To tackle the divergence issue, we propose a simple variation of the multigrid V-cycle that scales the coarse problem. We present a convergence proof for the new algorithm. By implementing this new method for local recovery, our numerical experiments confirm that convergence can be recovered on unstructured grids while the algorithm agrees with the standard multigrid V-cycle on grids with aligned boundaries. More importantly, the impact of a fault can be mitigated and delays in the global multigrid iterations can be reduced. Finally, we investigate how local regions within the adaptive mesh, associated with different faulty processors, affect the effectiveness of fault recovery.
引用
收藏
页码:S145 / S169
页数:25
相关论文
共 32 条
  • [1] Resiliency in numerical algorithm design for extreme scale simulations
    Agullo, Emmanuel
    Altenbernd, Mirco
    Anzt, Hartwig
    Bautista-Gomez, Leonardo
    Benacchio, Tommaso
    Bonaventura, Luca
    Bungartz, Hans-Joachim
    Chatterjee, Sanjay
    Ciorba, Florina M.
    DeBardeleben, Nathan
    Drzisga, Daniel
    Eibl, Sebastian
    Engelmann, Christian
    Gansterer, Wilfried N.
    Giraud, Luc
    Goddeke, Dominik
    Heisig, Marco
    Jezequel, Fabienne
    Kohl, Nils
    Li, Xiaoye Sherry
    Lion, Romain
    Mehl, Miriam
    Mycek, Paul
    Obersteiner, Michael
    Quintana-Orti, Enrique S.
    Rizzi, Francesco
    Ruede, Ulrich
    Schulz, Martin
    Fung, Fred
    Speck, Robert
    Stals, Linda
    Teranishi, Keita
    Thibault, Samuel
    Thoennes, Dominik
    Wagner, Andreas
    Wohlmuth, Barbara
    [J]. INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2022, 36 (02) : 251 - 285
  • [2] Numerical recovery strategies for parallel resilient Krylov linear solvers
    Agullo, Emmanuel
    Giraud, Luc
    Guermouche, Abdou
    Roman, Jean
    Zounon, Mawussi
    [J]. NUMERICAL LINEAR ALGEBRA WITH APPLICATIONS, 2016, 23 (05) : 888 - 905
  • [3] IS THE MULTIGRID METHOD FAULT TOLERANT? THE MULTILEVEL CASE
    Ainsworth, Mark
    Glusa, Christian
    [J]. SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2017, 39 (06) : C393 - C416
  • [4] IS THE MULTIGRID METHOD FAULT TOLERANT? THE TWO-GRID CASE
    Ainsworth, Mark
    Glusa, Christian
    [J]. SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2017, 39 (02) : C116 - C143
  • [5] Altenbernd Mirco, 2021, High Performance Computing in Science and Engineering. 4th International Conference, HPCSE 2019. Revised Selected Papers. Lecture Notes in Computer Science (LNCS 12456), P17, DOI 10.1007/978-3-030-67077-1_2
  • [6] Soft fault detection and correction for multigrid
    Altenbernd, Mirco
    Goeddeke, Dominik
    [J]. INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2018, 32 (06) : 897 - 912
  • [7] LOCAL MESH REFINEMENT MULTILEVEL TECHNIQUES
    BAI, D
    BRANDT, A
    [J]. SIAM JOURNAL ON SCIENTIFIC AND STATISTICAL COMPUTING, 1987, 8 (02): : 109 - 134
  • [8] Bland W., 2012, RECENT ADV MESSAGE P, V7490, P193
  • [9] An evaluation of User-Level Failure Mitigation support in MPI
    Bland, Wesley
    Bouteiller, Aurelien
    Herault, Thomas
    Hursey, Joshua
    Bosilca, George
    Dongarra, Jack J.
    [J]. COMPUTING, 2013, 95 (12) : 1171 - 1184
  • [10] A NEW CONVERGENCE PROOF FOR THE MULTIGRID METHOD INCLUDING THE V-CYCLE
    BRAESS, D
    HACKBUSCH, W
    [J]. SIAM JOURNAL ON NUMERICAL ANALYSIS, 1983, 20 (05) : 967 - 975