High Performance Dense Linear System Solver with Resilience to Multiple Soft Errors

被引:14
作者
Du, Peng [1 ]
Luszczek, Piotr [1 ]
Dongarra, Jack [1 ]
机构
[1] Univ Tennessee, Dept Elect Engn & Comp Sci, Knoxville, TN 37996 USA
来源
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE, ICCS 2012 | 2012年 / 9卷
关键词
soft error; fault tolerance; multiple errors; dense linear system solver;
D O I
10.1016/j.procs.2012.04.023
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In the multi-peta-flop era for supercomputers, the number of computing cores is growing exponentially. However, as integrated circuit technology scales below 65 nm, the critical charge required to flip a gate or a memory cell has been reduced and thus causing higher soft error rate from cosmic-radiations. Soft errors affect computers by producing silently data corruption which is hard to detect and correct. Current research of soft errors resilience for dense linear solver offers limited capability when facing large scale computing systems, and suffers from both soft error and round-off error due to floating point arithmetic. This work proposes a fault tolerant algorithm that recovers the solution of a dense linear system Ax = b from multiple spatial and temporal soft errors. Experimental results on Cray XT5 supercomputer confirm scalable performance of the proposed resilience functionality and negligible overhead in solution recovery.
引用
收藏
页码:216 / 225
页数:10
相关论文
共 24 条
[1]  
Abraham J. A., 1986, Proceedings of the SPIE - The International Society for Optical Engineering, V614, P49
[2]  
Abts D., ARCHITECTURAL SUPPOR
[3]   A LINEAR ALGEBRAIC MODEL OF ALGORITHM-BASED FAULT TOLERANCE [J].
ANFINSON, CJ ;
LUK, FT .
IEEE TRANSACTIONS ON COMPUTERS, 1988, 37 (12) :1599-1604
[4]  
[Anonymous], 2009, FAULT TOLERANCE EXTR
[5]   Complex version of high performance computing LINPACK benchmark (HPL) [J].
Barrett, R. F. ;
Chan, T. H. F. ;
D'Azevedo, E. F. ;
Jaeger, E. F. ;
Wong, K. ;
Wong, R. Y. .
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2010, 22 (05) :573-587
[6]  
Bronevetsky G, 2008, ICS'08: PROCEEDINGS OF THE 2008 ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, P155
[7]  
Bronevetsky Greg, 2009, Technical report
[8]  
Dongarra J., SCALAPACK USERS GUID
[9]  
Du P., 2011, 252 LAPACK
[10]  
Du P., 2011, P IEEE CLUST 2011