A High-dimensional Algorithm-Based Fault Tolerance Scheme

被引:1
|
作者
Fu, Xiang [1 ]
Tang, Hao [1 ]
Liao, Huimin [1 ]
Huang, Xin [1 ]
Xu, Wubiao [1 ]
Meng, Shiman [1 ]
Zhang, Weiping [1 ]
Guo, Luanzheng [2 ]
Sato, Kento [3 ]
机构
[1] Nanchang Hangkong Univ, Nanchang, Jiangxi, Peoples R China
[2] Pacific Northwest Natl Lab, Richland, WA USA
[3] RIKEN, RCCS, Kobe, Hyogo, Japan
关键词
D O I
10.1109/IPDPSW59300.2023.00061
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Tensor Algebra is a powerful tool for carrying out high-order data analytics in scientific applications, such as finite element analysis, N-body simulation, and quantum chemistry. Many of these applications are critical in terms of correctness and safety. Since these applications often run on High Performance Computing (HPC) systems, which are susceptible to soft errors caused by cosmic rays, unstable voltage, etc., we must ensure that the execution of these applications is reliable and resilient, and the execution outcome is highly trustworthy. However, traditional fault tolerance methods like error-correcting codes cannot protect computations. Checkpointing and redundancy techniques like triple modular redundancy (TMR) suffer from high-performance overhead, while existing algorithm-based fault tolerance (ABFT) approaches focus only on 2D linear algebra computations that are inefficient for tensor algebra computations. We understand that high-level tensor algebra computations can be decomposed into 2D linear algebra computations to be protected by existing ABFT methods, but this often introduces unacceptable performance overhead. Hence, for the first time, we propose a collection of different ABFT algorithms for addressing three fundamental tensor algebra operations. We make the best use of the algorithmic semantics of these tensor algebra computations to achieve better performance.
引用
收藏
页码:326 / 330
页数:5
相关论文
共 50 条
  • [41] Hybrid Filter and Genetic Algorithm-Based Feature Selection for Improving Cancer Classification in High-Dimensional Microarray Data
    Ali, Waleed
    Saeed, Faisal
    PROCESSES, 2023, 11 (02)
  • [42] GPU-ABFT: Optimizing Algorithm-Based Fault Tolerance for Heterogeneous Systems with GPUs
    Chen, Jieyang
    Li, Sihuan
    Chen, Zizhong
    2016 IEEE INTERNATIONAL CONFERENCE ON NETWORKING ARCHITECTURE AND STORAGE (NAS), 2016,
  • [43] Reduced-precision Algorithm-based Fault Tolerance for FPGA-implemented Accelerators
    Davis, James J.
    Cheung, Peter Y. K.
    APPLIED RECONFIGURABLE COMPUTING, ARC 2016, 2016, : 361 - 368
  • [44] Exploiting Redundant Computation in Communication-Avoiding Algorithms for Algorithm-Based Fault Tolerance
    Coti, Camille
    2016 IEEE 2ND INTERNATIONAL CONFERENCE ON BIG DATA SECURITY ON CLOUD (BIGDATASECURITY), IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE AND SMART COMPUTING (HPSC), AND IEEE INTERNATIONAL CONFERENCE ON INTELLIGENT DATA AND SECURITY (IDS), 2016, : 214 - 219
  • [45] Mantissa-preserving operations and robust algorithm-based fault tolerance for matrix computations
    Dutt, S
    Assaad, FT
    IEEE TRANSACTIONS ON COMPUTERS, 1996, 45 (04) : 408 - 424
  • [46] AutoEncoder based High-Dimensional Data Fault Detection System
    Fan, Jicong
    Wang, Wei
    Zhang, Haijun
    2017 IEEE 15TH INTERNATIONAL CONFERENCE ON INDUSTRIAL INFORMATICS (INDIN), 2017, : 1001 - 1006
  • [47] Clustering algorithm of high-dimensional data based on units
    School of In formation Engineering, Hubei Institute for Nationalities, Enshi 445000, China
    Jisuanji Yanjiu yu Fazhan, 2007, 9 (1618-1623): : 1618 - 1623
  • [48] Genetic Algorithm-based Electromagnetic Fault Injection
    Maldini, Antun
    Samwel, Niels
    Picek, Stjepan
    Batina, Lejla
    2018 WORKSHOP ON FAULT DIAGNOSIS AND TOLERANCE IN CRYPTOGRAPHY (FDTC), 2018, : 35 - 42
  • [49] Detection of soft errors in LU decomposition with partial pivoting using algorithm-based fault tolerance
    Yao, Erlin
    Zhang, Jiutian
    Chen, Mingyu
    Tan, Guangming
    Sun, Ninghui
    INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2015, 29 (04): : 422 - 436
  • [50] ALGORITHM-BASED FAULT TOLERANCE FOR ADAPTIVE LEAST-SQUARES LATTICE FILTERING ON A HYPERCUBE MULTIPROCESSOR
    MUELLERTHUNS, RB
    MCFARLAND, D
    BANERJEE, P
    PROCEEDINGS OF THE 1989 INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, VOL 3: ALGORITHMS AND APPLICATIONS, 1989, : 177 - 180