Fail-Stop Failure Algorithm-Based Fault Tolerance for Cholesky Decomposition

被引:22
作者
Hakkarinen, Doug [1 ]
Wu, Panruo [2 ]
Chen, Zizhong [2 ]
机构
[1] Colorado Sch Mines, Dept Elect Engn & Comp Sci, Golden, CO 80401 USA
[2] Univ Calif Riverside, Dept Comp Sci & Engn, Riverside, CA 92521 USA
基金
美国国家科学基金会;
关键词
Algorithm based fault tolerance (ABFT); cholesky decomposition; extreme-scale systems; fail-stop failures;
D O I
10.1109/TPDS.2014.2320502
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Cholesky decomposition is a widely used algorithm to solve linear equations with symmetric and positive definite coefficient matrix. With large matrices, this often will be performed on high performance supercomputers with a large number of processors. Assuming a constant failure rate per processor, the probability of a failure occurring during the execution increases linearly with additional processors. Fault tolerant methods attempt to reduce the expected execution time by allowing recovery from failure. This paper presents an analysis and implementation of a fault tolerant Cholesky factorization algorithm that does not require checkpointing for recovery from fail-stop failures. Rather, this algorithm uses redundant data added in an additional set of processes. This differs from previous works with algorithmic methods as it addresses fail-stop failures rather than fail-continue cases. The proposed fault tolerance scheme is incorporated into ScaLAPACK and validated on the supercomputer Kraken. Experimental results demonstrate that this method has decreasing overhead in relation to overall runtime as the matrix size increases, and thus shows promise to reduce the expected runtime for Cholesky factorizations on very large matrices.
引用
收藏
页码:1323 / 1335
页数:13
相关论文
共 32 条
[1]   Performance evaluation of checksum-based ABFT [J].
Al-Yamani, AA ;
Oh, N ;
McCluskey, EJ .
2001 IEEE INTERNATIONAL SYMPOSIUM ON DEFECT AND FAULT TOLERANCE IN VLSI SYSTEMS, PROCEEDINGS, 2001, :461-466
[2]   A LINEAR ALGEBRAIC MODEL OF ALGORITHM-BASED FAULT TOLERANCE [J].
ANFINSON, CJ ;
LUK, FT .
IEEE TRANSACTIONS ON COMPUTERS, 1988, 37 (12) :1599-1604
[3]  
[Anonymous], 2002, Accuracy and stability of numerical algorithms
[4]  
[Anonymous], 2011, P INT C SUP
[5]   BOUNDS ON ALGORITHM-BASED FAULT TOLERANCE IN MULTIPLE PROCESSOR SYSTEMS. [J].
Banerjee, Prithviraj ;
Abraham, Jacob A. .
IEEE Transactions on Computers, 1986, C-35 (04) :296-306
[6]   ALGORITHM-BASED FAULT TOLERANCE ON A HYPERCUBE MULTIPROCESSOR [J].
BANERJEE, P ;
RAHMEH, JT ;
STUNKEL, C ;
NAIR, VS ;
ROY, K ;
BALASUBRAMANIAN, V ;
ABRAHAM, JA .
IEEE TRANSACTIONS ON COMPUTERS, 1990, 39 (09) :1132-1145
[7]  
CHEN Z., 2009, P C HIGH PERF COMP N, P29
[8]   Algorithm-Based Fault Tolerance for Fail-Stop Failures [J].
Chen, Zizhong ;
Dongarra, Jack .
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2008, 19 (12) :1628-1641
[9]   Online-ABFT: An Online Algorithm Based Fault Tolerance Scheme for Soft Error Detection in Iterative Methods [J].
Chen, Zizhong .
ACM SIGPLAN NOTICES, 2013, 48 (08) :167-176
[10]  
Chen ZZ, 2011, HPDC 11: PROCEEDINGS OF THE 20TH INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE DISTRIBUTED COMPUTING, P73