Numerical recovery strategies for parallel resilient Krylov linear solvers

被引:21
作者
Agullo, Emmanuel [1 ]
Giraud, Luc [1 ]
Guermouche, Abdou [2 ]
Roman, Jean [1 ]
Zounon, Mawussi [1 ]
机构
[1] Inria, Bordeaux, France
[2] Univ Bordeaux, Bordeaux, France
关键词
fault tolerance; hard faults; numerical linear system solution; numerical remedies; numerical resiliency; FAULT-TOLERANCE; ERROR-DETECTION; ALGORITHM; FACTORIZATION; GMRES; QR;
D O I
10.1002/nla.2059
中图分类号
O29 [应用数学];
学科分类号
070104 ;
摘要
As the computational power of high-performance computing systems continues to increase by using a huge number of cores or specialized processing units, high-performance computing applications are increasingly prone to faults. In this paper, we present a new class of numerical fault tolerance algorithms to cope with node crashes in parallel distributed environments. This new resilient scheme is designed at application level and does not require extra resources, that is, computational unit or computing time, when no fault occurs. In the framework of iterative methods for the solution of sparse linear systems, we present numerical algorithms to extract relevant information from available data after a fault, assuming a separate mechanism ensures the fault detection. After data extraction, a well-chosen part of missing data is regenerated through interpolation strategies to constitute meaningful inputs to restart the iterative scheme. We have developed these methods, referred to as interpolation-restart techniques, for Krylov subspace linear solvers. After a fault, lost entries of the current iterate computed by the solver are interpolated to define a new initial guess to restart the Krylov method. A well-suited initial guess is computed by using the entries of the faulty iterate available on surviving nodes. We present two interpolation policies that preserve key numerical properties of well-known linear solvers, namely, the monotonic decrease of the A-norm of the error of the conjugate gradient or the residual norm decrease of generalized minimal residual algorithm for solving. The qualitative numerical behavior of the resulting scheme has been validated with sequential simulations, when the number of faults and the amount of data losses are varied. Finally, the computational costs associated with the recovery mechanism have been evaluated through parallel experiments. Copyright (c) 2016 John Wiley & Sons, Ltd.
引用
收藏
页码:888 / 905
页数:18
相关论文
共 43 条
[1]  
Agullo E., 2013, RR8324 INRIA
[2]  
Agullo E, 2015, 8625 INRIA
[3]   Message logging: Pessimistic, optimistic, causal, and optimal [J].
Alvisi, L ;
Marzullo, K .
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 1998, 24 (02) :149-159
[4]   Hybrid scheduling for the parallel solution of linear systems [J].
Amestoy, PR ;
Guermouche, A ;
L'Excellent, JY ;
Pralet, S .
PARALLEL COMPUTING, 2006, 32 (02) :136-156
[5]   A LINEAR ALGEBRAIC MODEL OF ALGORITHM-BASED FAULT TOLERANCE [J].
ANFINSON, CJ ;
LUK, FT .
IEEE TRANSACTIONS ON COMPUTERS, 1988, 37 (12) :1599-1604
[6]  
[Anonymous], P HIGH AV PERF COMP
[7]   DIVA: A reliable substrate for deep submicron microarchitecture design [J].
Austin, TM .
32ND ANNUAL INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE, (MICRO-32), PROCEEDINGS, 1999, :196-207
[8]  
Borg A., 1983, Operating Systems Review, V17, P90, DOI 10.1145/773379.806617
[9]   FINE-GRAINED MULTITHREADING FOR THE MULTIFRONTAL QR FACTORIZATION OF SPARSE MATRICES [J].
Buttari, Alfredo .
SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2013, 35 (04) :C323-C345
[10]   PREVENTIVE MIGRATION VS. PREVENTIVE CHECKPOINTING FOR EXTREME SCALE SUPERCOMPUTERS [J].
Cappello, Franck ;
Casanova, Henri ;
Robert, Yves .
PARALLEL PROCESSING LETTERS, 2011, 21 (02) :111-132