Complex scientific applications made fault-tolerant with the sparse grid combination technique

被引:11
作者
Ali, Md Mohsin [1 ]
Strazdins, Peter E. [1 ]
Harding, Brendan [2 ]
Hegland, Markus [2 ]
机构
[1] Australian Natl Univ, Res Sch Comp Sci, Canberra, ACT 2601, Australia
[2] Australian Natl Univ, Math Sci Inst, Canberra, ACT, Australia
基金
澳大利亚研究理事会;
关键词
Fault tolerance; ULFM; process failure recovery; PDE solver; sparse grid combination; approximation error; gyrokinetic plasma; Taxila Lattice Boltzmann Method; Solid Fuel Ignition; PARALLEL;
D O I
10.1177/1094342015628056
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Ultra-large-scale simulations via solving partial differential equations (PDEs) require very large computational systems for their timely solution. Studies shown the rate of failure grows with the system size, and these trends are likely to worsen in future machines. Thus, as systems, and the problems solved on them, continue to grow, the ability to survive failures is becoming a critical aspect of algorithm development. The sparse grid combination technique (SGCT) which is a cost-effective method for solving higher dimensional PDEs can be easily modified to provide algorithm-based fault tolerance. In this article, we describe how the SGCT can produce fault-tolerant versions of the Gyrokinetic Electromagnetic Numerical Experiment plasma application, Taxila Lattice Boltzmann Method application, and Solid Fuel Ignition application. We use an alternate component grid combination formula by adding some redundancy on the SGCT to recover data from lost processes. User-level failure mitigation (ULFM) message passing interface (MPI) is used to recover the processes, and our implementation is robust over multiple failures and recovery (processes and nodes). An acceptable degree of modification of the applications is required. Results using the 2-D SGCT show competitive execution times with acceptable error (within 0.1% to 1.0%), compared to the same simulation with a single full resolution grid. The benefits improve when the 3-D SGCT is used. Experiments show the applications ability to successfully recover from multiple failures, and applying multiple SGCT reduces the computed solution error. Process recovery via ULFM MPI increases from approximately 1.5 sec at 64 cores to approximately 5sec at 2048 cores for a one-off failure. This compares applications' built-in checkpointing with job restart in conjunction with the classical SGCT on failure, which have overheads four times as large for a single failure, excluding the recomputation overhead. An analysis for a long-running application considering recomputation times indicates a reduction in overhead of over an order of magnitude.
引用
收藏
页码:335 / 359
页数:25
相关论文
共 48 条
[1]   TOFU: A 6D MESH/TORUS INTERCONNECT FOR EXASCALE COMPUTERS [J].
Ajima, Yuichiro ;
Sumimoto, Shinji ;
Shimizu, Toshiyuki .
COMPUTER, 2009, 42 (11) :36-40
[2]   Application Level Fault Recovery: Using Fault-Tolerant Open MPI in a PDE Solver [J].
Ali, Md Mohsin ;
Southern, James ;
Strazdins, Peter ;
Harding, Brendan .
PROCEEDINGS OF 2014 IEEE INTERNATIONAL PARALLEL & DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2014, :1170-1179
[3]  
Ali M. M., 2013, P 3 INT C PERF SAF R, P40
[4]  
Ali MM, 2015, PROCEEDINGS OF THE 2015 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS 2015), P499, DOI 10.1109/HPCSim.2015.7237082
[5]  
[Anonymous], 2012, EUR MPI US GROUP M
[6]  
[Anonymous], 1989, Mathematical problems from combustion theory
[7]  
[Anonymous], 2015, SOLVING BRATU SFI SO
[8]  
[Anonymous], 2007, P 21 IEEE INT PAR DI
[9]  
Balay S., 2014, PETSc Web page
[10]  
Benk J., 2012, 2012 International Conference on High Performance Computing & Simulation (HPCS 2012), P678, DOI 10.1109/HPCSim.2012.6266992