Online Algorithm-Based Fault Tolerance for Cholesky Decomposition on Heterogeneous Systems with GPUs

被引:18
作者
Chen, Jieyang [1 ]
Liang, Xin [1 ]
Chen, Zizhong [1 ]
机构
[1] Univ Calif Riverside, Dept Comp Sci & Engn, Riverside, CA 92521 USA
来源
2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2016) | 2016年
关键词
D O I
10.1109/IPDPS.2016.81
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Extensive researches have been done on developing and optimizing algorithm-based fault tolerance (ABFT) schemes for systolic arrays and general purpose microprocessors. However, little has been done on developing and optimizing ABFT schemes for heterogeneous systems with GPU accelerators. While existing ABFT schemes can correct computing errors like 1+1=3, we find that many memory storage errors can not be corrected by existing ABFT schemes. In this paper, we first develop a new ABFT scheme for Cholesky decomposition that can correct both computing errors and storage errors at the same time, and then develop several optimization techniques to reduce the fault tolerance overhead of ABFT for heterogeneous systems with GPU accelerators. Experimental results demonstrate that our fault tolerant Cholesky decomposition is able to correct both computing errors and storage errors in the middle of the computation and can achieve better performance than the state-of-the-art vendor provided version Cholesky decomposition library routine in CULA R18.
引用
收藏
页码:993 / 1002
页数:10
相关论文
共 27 条
[1]  
Banerjee P., 1990, COMPUTERS IEEE T, V39
[2]  
Bautista-Gomez L., 2011, P 2011 INT C HIGH PE, P1
[3]  
Berrocal E., 2015, P 24 INT S HIGH PERF
[4]  
Bouteiller A., 2015, ACM Transactions on Parallel Computing, V1, P10, DOI DOI 10.1145/2686892
[5]  
Chen L., EXTENDING CHECKSUM B
[6]  
Chen Z., 2013, ACM SIGPLAN NOTICES
[7]   Algorithm-Based Fault Tolerance for Fail-Stop Failures [J].
Chen, Zizhong ;
Dongarra, Jack .
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2008, 19 (12) :1628-1641
[8]  
Davies T., 2013, P 22 INT S HIGH PERF
[9]  
Ding C., 2011, PAR DISTR PROC APPL
[10]  
Du P., 2011, CLUST COMP CLUSTER 2