A High-dimensional Algorithm-Based Fault Tolerance Scheme

被引:1
作者
Fu, Xiang [1 ]
Tang, Hao [1 ]
Liao, Huimin [1 ]
Huang, Xin [1 ]
Xu, Wubiao [1 ]
Meng, Shiman [1 ]
Zhang, Weiping [1 ]
Guo, Luanzheng [2 ]
Sato, Kento [3 ]
机构
[1] Nanchang Hangkong Univ, Nanchang, Jiangxi, Peoples R China
[2] Pacific Northwest Natl Lab, Richland, WA USA
[3] RIKEN, RCCS, Kobe, Hyogo, Japan
来源
2023 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS, IPDPSW | 2023年
关键词
D O I
10.1109/IPDPSW59300.2023.00061
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Tensor Algebra is a powerful tool for carrying out high-order data analytics in scientific applications, such as finite element analysis, N-body simulation, and quantum chemistry. Many of these applications are critical in terms of correctness and safety. Since these applications often run on High Performance Computing (HPC) systems, which are susceptible to soft errors caused by cosmic rays, unstable voltage, etc., we must ensure that the execution of these applications is reliable and resilient, and the execution outcome is highly trustworthy. However, traditional fault tolerance methods like error-correcting codes cannot protect computations. Checkpointing and redundancy techniques like triple modular redundancy (TMR) suffer from high-performance overhead, while existing algorithm-based fault tolerance (ABFT) approaches focus only on 2D linear algebra computations that are inefficient for tensor algebra computations. We understand that high-level tensor algebra computations can be decomposed into 2D linear algebra computations to be protected by existing ABFT methods, but this often introduces unacceptable performance overhead. Hence, for the first time, we propose a collection of different ABFT algorithms for addressing three fundamental tensor algebra operations. We make the best use of the algorithmic semantics of these tensor algebra computations to achieve better performance.
引用
收藏
页码:326 / 330
页数:5
相关论文
共 16 条
  • [1] Online Algorithm-Based Fault Tolerance for Cholesky Decomposition on Heterogeneous Systems with GPUs
    Chen, Jieyang
    Liang, Xin
    Chen, Zizhong
    [J]. 2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2016), 2016, : 993 - 1002
  • [2] Chen Z., 2013, PPOPP
  • [3] Algorithm-based Fault Tolerance for Dense Matrix Factorizations
    Du, Peng
    Bouteiller, Aurelien
    Bosilca, George
    Herault, Thomas
    Dongarra, Jack
    [J]. ACM SIGPLAN NOTICES, 2012, 47 (08) : 225 - 234
  • [4] Guo L., 2016, ACMIEEE INT C HIGH P
  • [5] PARIS: Predicting application resilience using machine learning
    Guo, Luanzheng
    Li, Dong
    Laguna, Ignacio
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2021, 152 : 111 - 124
  • [6] MATCH: An MPI Fault Tolerance Benchmark Suite
    Guo, Luanzheng
    Georgakoudis, Giorgis
    Parasyris, Konstantinos
    Laguna, Ignacio
    Li, Dong
    [J]. 2020 IEEE INTERNATIONAL SYMPOSIUM ON WORKLOAD CHARACTERIZATION (IISWC 2020), 2020, : 60 - 71
  • [7] MOARD: Modeling Application Resilience to Transient Faults on Data Objects
    Guo, Luanzheng
    Li, Dong
    [J]. 2019 IEEE 33RD INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2019), 2019, : 878 - 889
  • [8] Guo LZ, 2018, PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE, AND ANALYSIS (SC'18), DOI 10.1109/SC.2018.00011
  • [9] Rethinking Algorithm-Based Fault Tolerance with a Cooperative Software-Hardware Approach
    Li, Dong
    Chen, Zizhong
    Wu, Panruo
    Vetter, Jeffrey S.
    [J]. 2013 INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC), 2013,
  • [10] PASTA: a parallel sparse tensor algorithm benchmark suite
    Li, Jiajia
    Ma, Yuchen
    Wu, Xiaolong
    Li, Ang
    Barker, Kevin
    [J]. CCF TRANSACTIONS ON HIGH PERFORMANCE COMPUTING, 2019, 1 (02) : 111 - 130