Extending and Evaluating Fault-Tolerant Preconditioned Conjugate Gradient Methods

被引:9
作者
Pachajoa, Carlos [1 ]
Levonyak, Markus [1 ]
Gansterer, Wilfried N. [1 ]
机构
[1] Univ Vienna, Fac Comp Sci, Vienna, Austria
来源
PROCEEDINGS OF FTXS 2018: IEEE/ACM 8TH WORKSHOP ON FAULT TOLERANCE FOR HPC AT EXTREME SCALE (FTXS) | 2018年
关键词
preconditioned conjugate gradient method; split preconditioner conjugate gradient method; extreme-scale parallel computing; node failures; resilience; algorithmic fault tolerance; ITERATIVE METHODS; RECOVERY;
D O I
10.1109/FTXS.2018.00009
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We compare and refine exact and heuristic fault-tolerance extensions for the preconditioned conjugate gradient (PCG) and the split preconditioner conjugate gradient (SPCG) methods for recovering from failures of compute nodes of large-scale parallel computers. In the exact state reconstruction (ESR) approach, which is based on a method proposed by Chen (2011), the solver keeps extra information from previous search directions of the (S) PCG solver, so that its state can be fully reconstructed if a node fails unexpectedly. ESR does not make use of checkpointing or external storage for saving dynamic solver data and has only negligible computation and communication overhead compared to the failure free situation. In exact arithmetic, the reconstruction is exact, but in finite precision computations, the number of iterations until convergence can differ slightly from the failure free case due to rounding effects. We perform experiments to investigate the behavior of ESR in floating point arithmetic and compare it to the heuristic linear interpolation (LI) approach by Langou et al. (2007) and Agullo et al. (2016), which does not have to keep extra information and thus has lower memory requirements. Our experiments illustrate that ESR, on average, has essentially zero overhead in terms of additional iterations until convergence, whereas the LI approach incurs much larger overheads.
引用
收藏
页码:49 / 58
页数:10
相关论文
共 25 条
[1]   Numerical recovery strategies for parallel resilient Krylov linear solvers [J].
Agullo, Emmanuel ;
Giraud, Luc ;
Guermouche, Abdou ;
Roman, Jean ;
Zounon, Mawussi .
NUMERICAL LINEAR ALGEBRA WITH APPLICATIONS, 2016, 23 (05) :888-905
[2]  
Agullo Emmanuel., 2013, Towards resilient parallel linear Krylov solvers: recover-restart strategies
[3]  
[Anonymous], 2015, International Journal of Networking and Computing
[4]  
Balay S, 1997, MODERN SOFTWARE TOOLS FOR SCIENTIFIC COMPUTING, P163
[5]  
Balay S., 2018, TECH REP
[6]   Post-failure recovery of MPI communication capability: Design and rationale [J].
Bland, Wesley ;
Bouteiller, Aurelien ;
Herault, Thomas ;
Bosilca, George ;
Dongarra, Jack .
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2013, 27 (03) :244-254
[7]   Assessing the impact of ABFT & Checkpoint composite strategies [J].
Bosilca, George ;
Bouteiller, Aurelien ;
Herault, Thomas ;
Robert, Yves ;
Dongarra, Jack .
PROCEEDINGS OF 2014 IEEE INTERNATIONAL PARALLEL & DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2014, :680-689
[8]  
Bronevetsky G, 2008, ICS'08: PROCEEDINGS OF THE 2008 ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, P155
[9]  
Chen ZZ, 2011, HPDC 11: PROCEEDINGS OF THE 20TH INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE DISTRIBUTED COMPUTING, P73
[10]   S-STEP ITERATIVE METHODS FOR SYMMETRIC LINEAR-SYSTEMS [J].
CHRONOPOULOS, AT ;
GEAR, CW .
JOURNAL OF COMPUTATIONAL AND APPLIED MATHEMATICS, 1989, 25 (02) :153-168