Hessenberg Reduction with Transient Error Resilience on GPU-Based Hybrid Architectures

被引:2
作者
Jia, Yulu [1 ]
Luszczek, Piotr [1 ]
Dongarra, Jack [1 ,2 ,3 ]
机构
[1] Univ Tennessee, Knoxville, TN 37996 USA
[2] Oak Ridge Natl Lab, Oak Ridge, TN USA
[3] Univ Manchester, Manchester M13 9PL, Lancs, England
来源
2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW) | 2016年
关键词
LINEAR-SYSTEM SOLVER; FAULT-TOLERANCE; SOFT ERRORS; PERFORMANCE;
D O I
10.1109/IPDPSW.2016.34
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Graphics Processing Units (GPUs) have been seeing widespread adoption in the field of scientific computing, owing to the performance gains provided on computation-intensive applications. In this paper, we present the design and implementation of a Hessenberg reduction algorithm immune to simultaneous soft-errors, capable of taking advantage of hybrid GPU-CPU platforms. These soft-errors are detected and corrected on the fly, preventing the propagation of the error to the rest of the data. Our design is at the intersection between several fault tolerant techniques and employs the algorithm-based fault tolerance technique, diskless checkpointing, and reverse computation to achieve its goal. By utilizing the idle time of the CPUs, and by overlapping both host-side and GPU-side workloads, we minimize the resilience overhead. Experimental results have validated our design decisions as our algorithm introduced less than 2% performance overhead compared to the optimized, but fault-prone, hybrid Hessenberg reduction.
引用
收藏
页码:653 / 662
页数:10
相关论文
共 27 条
[1]  
ANDERSON E., 1999, LAPACK USERSGUIDE, V3rd
[2]  
[Anonymous], 2012, MATRIX COMPUTATIONS
[3]  
[Anonymous], SPIE P 30 ANN TECHN
[4]   Radiation-induced soft errors in advanced semiconductor technologies [J].
Baumann, RC .
IEEE TRANSACTIONS ON DEVICE AND MATERIALS RELIABILITY, 2005, 5 (03) :305-316
[5]  
Bronevetsky G, 2008, ICS'08: PROCEEDINGS OF THE 2008 ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, P155
[6]  
Cao C., 2015, 29 IEEE INT PAR DIST
[7]   REDUCING FLOATING POINT ERROR IN DOT PRODUCT USING THE SUPERBLOCK FAMILY OF ALGORITHMS [J].
Castaldo, Anthony M. ;
Whaley, R. Clint ;
Chronopoulos, Anthony T. .
SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2008, 31 (02) :1156-1174
[8]   The LINPACK benchmark: past, present and future [J].
Dongarra, JJ ;
Luszczek, P ;
Petitet, A .
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2003, 15 (09) :803-820
[9]  
Du P., 2011, P 2 WORKSH SCAL ALG, P11, DOI [DOI 10.1145/2133173.2133179, 10.1145/2133173.2133179]
[10]   High Performance Dense Linear System Solver with Resilience to Multiple Soft Errors [J].
Du, Peng ;
Luszczek, Piotr ;
Dongarra, Jack .
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE, ICCS 2012, 2012, 9 :216-225