A High-dimensional Algorithm-Based Fault Tolerance Scheme

被引：2

作者：

Fu, Xiang ^{[1
]}

Tang, Hao ^{[1
]}

Liao, Huimin ^{[1
]}

Huang, Xin ^{[1
]}

Xu, Wubiao ^{[1
]}

Meng, Shiman ^{[1
]}

Zhang, Weiping ^{[1
]}

Guo, Luanzheng ^{[2
]}

Sato, Kento ^{[3
]}

机构：

[1] Nanchang Hangkong Univ, Nanchang, Jiangxi, Peoples R China

[2] Pacific Northwest Natl Lab, Richland, WA USA

[3] RIKEN, RCCS, Kobe, Hyogo, Japan

来源：

2023 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS, IPDPSW | 2023年

关键词：

D O I：

10.1109/IPDPSW59300.2023.00061

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Tensor Algebra is a powerful tool for carrying out high-order data analytics in scientific applications, such as finite element analysis, N-body simulation, and quantum chemistry. Many of these applications are critical in terms of correctness and safety. Since these applications often run on High Performance Computing (HPC) systems, which are susceptible to soft errors caused by cosmic rays, unstable voltage, etc., we must ensure that the execution of these applications is reliable and resilient, and the execution outcome is highly trustworthy. However, traditional fault tolerance methods like error-correcting codes cannot protect computations. Checkpointing and redundancy techniques like triple modular redundancy (TMR) suffer from high-performance overhead, while existing algorithm-based fault tolerance (ABFT) approaches focus only on 2D linear algebra computations that are inefficient for tensor algebra computations. We understand that high-level tensor algebra computations can be decomposed into 2D linear algebra computations to be protected by existing ABFT methods, but this often introduces unacceptable performance overhead. Hence, for the first time, we propose a collection of different ABFT algorithms for addressing three fundamental tensor algebra operations. We make the best use of the algorithmic semantics of these tensor algebra computations to achieve better performance.

引用

页码：326 / 330

页数：5

共 16 条

[1] Online Algorithm-Based Fault Tolerance for Cholesky Decomposition on Heterogeneous Systems with GPUs [J].

Chen, Jieyang ;

Liang, Xin ;

Chen, Zizhong .

2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2016), 2016, :993-1002

[2]

Chen Z., 2013, PPOPP

[3] Algorithm-based Fault Tolerance for Dense Matrix Factorizations [J].

Du, Peng ;

Bouteiller, Aurelien ;

Bosilca, George ;

Herault, Thomas ;

Dongarra, Jack .

ACM SIGPLAN NOTICES, 2012, 47 (08) :225-234

[4]

Guo L., 2016, ACMIEEE INT C HIGH P

[5] PARIS: Predicting application resilience using machine learning [J].

Guo, Luanzheng ;

Li, Dong ;

Laguna, Ignacio .

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2021, 152 :111-124

[6] MATCH: An MPI Fault Tolerance Benchmark Suite [J].

Guo, Luanzheng ;

Georgakoudis, Giorgis ;

Parasyris, Konstantinos ;

Laguna, Ignacio ;

Li, Dong .

2020 IEEE INTERNATIONAL SYMPOSIUM ON WORKLOAD CHARACTERIZATION (IISWC 2020), 2020, :60-71

[7] MOARD: Modeling Application Resilience to Transient Faults on Data Objects [J].

Guo, Luanzheng ;

Li, Dong .

2019 IEEE 33RD INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2019), 2019, :878-889

[8]

Guo LZ, 2018, PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE, AND ANALYSIS (SC'18), DOI 10.1109/SC.2018.00011

[9] Rethinking Algorithm-Based Fault Tolerance with a Cooperative Software-Hardware Approach [J].

Li, Dong ;

Chen, Zizhong ;

Wu, Panruo ;

Vetter, Jeffrey S. .

2013 INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC), 2013,

[10] PASTA: a parallel sparse tensor algorithm benchmark suite [J].

Li, Jiajia ;

Ma, Yuchen ;

Wu, Xiaolong ;

Li, Ang ;

Barker, Kevin .

CCF TRANSACTIONS ON HIGH PERFORMANCE COMPUTING, 2019, 1 (02) :111-130

← 1 2 →