A High-dimensional Algorithm-Based Fault Tolerance Scheme

被引:1
|
作者
Fu, Xiang [1 ]
Tang, Hao [1 ]
Liao, Huimin [1 ]
Huang, Xin [1 ]
Xu, Wubiao [1 ]
Meng, Shiman [1 ]
Zhang, Weiping [1 ]
Guo, Luanzheng [2 ]
Sato, Kento [3 ]
机构
[1] Nanchang Hangkong Univ, Nanchang, Jiangxi, Peoples R China
[2] Pacific Northwest Natl Lab, Richland, WA USA
[3] RIKEN, RCCS, Kobe, Hyogo, Japan
关键词
D O I
10.1109/IPDPSW59300.2023.00061
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Tensor Algebra is a powerful tool for carrying out high-order data analytics in scientific applications, such as finite element analysis, N-body simulation, and quantum chemistry. Many of these applications are critical in terms of correctness and safety. Since these applications often run on High Performance Computing (HPC) systems, which are susceptible to soft errors caused by cosmic rays, unstable voltage, etc., we must ensure that the execution of these applications is reliable and resilient, and the execution outcome is highly trustworthy. However, traditional fault tolerance methods like error-correcting codes cannot protect computations. Checkpointing and redundancy techniques like triple modular redundancy (TMR) suffer from high-performance overhead, while existing algorithm-based fault tolerance (ABFT) approaches focus only on 2D linear algebra computations that are inefficient for tensor algebra computations. We understand that high-level tensor algebra computations can be decomposed into 2D linear algebra computations to be protected by existing ABFT methods, but this often introduces unacceptable performance overhead. Hence, for the first time, we propose a collection of different ABFT algorithms for addressing three fundamental tensor algebra operations. We make the best use of the algorithmic semantics of these tensor algebra computations to achieve better performance.
引用
收藏
页码:326 / 330
页数:5
相关论文
共 50 条
  • [1] Algorithm-based fault tolerance: a review
    Vijay, M
    Mittal, R
    MICROPROCESSORS AND MICROSYSTEMS, 1997, 21 (03) : 151 - 161
  • [2] Algorithm-based fault tolerance applied to high performance computing
    Bosilca, George
    Delmas, Remi
    Dongarra, Jack
    Langou, Julien
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2009, 69 (04) : 410 - 416
  • [3] ALGORITHM-BASED FAULT TOLERANCE FOR MATRIX OPERATIONS
    HUANG, KH
    ABRAHAM, JA
    IEEE TRANSACTIONS ON COMPUTERS, 1984, 33 (06) : 518 - 528
  • [4] AN ANALYSIS OF ALGORITHM-BASED FAULT TOLERANCE TECHNIQUES
    LUK, FT
    PARK, H
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1988, 5 (02) : 172 - 184
  • [5] ALGORITHM-BASED FAULT TOLERANCE ON A HYPERCUBE MULTIPROCESSOR
    BANERJEE, P
    RAHMEH, JT
    STUNKEL, C
    NAIR, VS
    ROY, K
    BALASUBRAMANIAN, V
    ABRAHAM, JA
    IEEE TRANSACTIONS ON COMPUTERS, 1990, 39 (09) : 1132 - 1145
  • [6] Algorithm-Based Fault Tolerance for Parallel Stencil Computations
    Cavelan, Aurelien
    Ciorba, Florina M.
    2019 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2019, : 12 - 22
  • [7] Algorithm-based Fault Tolerance for Dense Matrix Factorizations
    Du, Peng
    Bouteiller, Aurelien
    Bosilca, George
    Herault, Thomas
    Dongarra, Jack
    ACM SIGPLAN NOTICES, 2012, 47 (08) : 225 - 234
  • [8] IMPROVED BOUNDS FOR ALGORITHM-BASED FAULT-TOLERANCE
    ROSENKRANTZ, DJ
    RAVI, SS
    IEEE TRANSACTIONS ON COMPUTERS, 1993, 42 (05) : 630 - 635
  • [9] A LINEAR ALGEBRAIC MODEL OF ALGORITHM-BASED FAULT TOLERANCE
    ANFINSON, CJ
    LUK, FT
    IEEE TRANSACTIONS ON COMPUTERS, 1988, 37 (12) : 1599 - 1604
  • [10] ALGORITHM-BASED FAULT-TOLERANCE FOR FFT NETWORKS
    WANG, SJ
    JHA, NK
    IEEE TRANSACTIONS ON COMPUTERS, 1994, 43 (07) : 849 - 854