Exploiting Asynchrony from Exact Forward Recovery for DUE in Iterative Solvers

被引:21
作者
Jaulmes, Luc [1 ]
Casas, Marc
Moreto, Miquel
Ayguade, Eduard
Labarta, Jesus
Valero, Mateo
机构
[1] Barcelona Supercomp Ctr, Barcelona, Spain
来源
PROCEEDINGS OF SC15: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS | 2015年
关键词
D O I
10.1145/2807591.2807599
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
This paper presents a method to protect iterative solvers from Detected and Uncorrected Errors (DUE) relying on error detection techniques already available in commodity hardware. Detection operates at the memory page level, which enables the use of simple algorithmic redundancies to correct errors. Such redundancies would be inapplicable under coarse grain error detection, but become very powerful when the hardware is able to precisely detect errors. Relations straightforwardly extracted from the solver allow to recover lost data exactly. This method is free of the overheads of backwards recoveries like checkpointing, and does not compromise mathematical convergence properties of the solver as restarting would do. We apply this recovery to three widely used Krylov subspace methods, CG, GM-RES and BiCGStab, and their preconditioned versions. We implement our resilience techniques on CG considering scenarios from small (8 cores) to large (1024 cores) scales, and demonstrate very low overheads compared to state-of-the-art solutions. We deploy our recovery techniques either by overlapping them with algorithmic computations or by forcing them to be in the critical path of the application. A trade-off exists between both approaches depending on the error rate the solver is suffering. Under realistic error rates, overlapping decreases overheads from 5.37% down to 3.59% for a non-preconditioned CG on 8 cores.
引用
收藏
页数:12
相关论文
共 39 条
[1]  
[Anonymous], 2011, AMD64 ARCH PROGR MAN, V2
[2]  
[Anonymous], 2014, SYST PROGR GUID 2 B, V3B
[3]  
[Anonymous], 2013, Research Report RR-8324
[4]  
[Anonymous], 2011, CISC VIS NETW IND GL
[5]  
Berry M., 1994, TEMPLATES SOLUTION L
[6]   Extending the scope of the Checkpoint-on-Failure protocol for forward recovery in standard MPI [J].
Bland, Wesley ;
Du, Peng ;
Bouteiller, Aurelien ;
Herault, Thomas ;
Bosilca, George ;
Dongarra, Jack J. .
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2013, 25 (17) :2381-2393
[7]  
Boughn Michael., 2011, H BOOK, P1
[8]  
Bridges PG, 2012, LECT NOTES COMPUT SC, V7156, P241, DOI 10.1007/978-3-642-29740-3_28
[9]  
Cappello Franck, 2014, [Supercomputing Frontiers and Innovations, Supercomputing Frontiers and Innovations], V1, P5
[10]   TOWARD EXASCALE RESILIENCE [J].
Cappello, Franck ;
Geist, Al ;
Gropp, Bill ;
Kale, Laxmikant ;
Kramer, Bill ;
Snir, Marc .
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2009, 23 (04) :374-388