Automatic Risk-based Selective Redundancy for Fault-tolerant Task-parallel HPC Applications

被引：5

作者：

Subasi, Omer ^{[1
]}

Unsal, Osman ^{[2
]}

Krishnamoorthy, Sriram ^{[1
]}

机构：

[1] Pacific Northwest Natl Lab, Richland, WA 99352 USA

[2] Barcelona Supercomp Ctr, Barcelona, Spain

来源：

PROCEEDINGS OF ESPM2 2017: THIRD INTERNATIONAL WORKSHOP ON EXTREME SCALE PROGRAMMING MODELS AND MIDDLEWARE | 2017年

关键词：

Fault-tolerance; selective redundancy; task-parallelism; dataflow;

D O I：

10.1145/3152041.3152083

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Silent data corruption (SDC) and fail-stop errors are the most hazardous error types in high-performance computing (HPC) systems. In this study, we present an automatic, efficient and lightweight redundancy mechanism to mitigate both error types. We propose partial task-replication and checkpointing for task-parallel HPC applications to mitigate silent and fail-stop errors. To avoid the prohibitive costs of complete replication, we introduce a lightweight selective replication mechanism. Using a fully automatic and transparent heuristics, we identify and selectively replicate only the reliability-critical tasks based on a risk metric. Our approach detects and corrects around 70% of silent errors with only 5% average performance overhead. Additionally, the performance overhead of the heuristic itself is negligible.

引用

页数：8

共 21 条

[1] Amer Abdelhalim, ISC13
[2] Andersch M., SAMOS12
[3] Andersch M., PARS11
[4] [Anonymous], Intel TBB 4.3 Update 2.
[5] [Anonymous], BSC APPL REPOSITORY
[6] Bridges Patrick G., CORR12
[7] Cappello Franck, IJPCA09
[8] Chung Jinsuk, SC12
[9] Dongarra Jack, IJPCA11
[10] OnipSs: A PROPOSAL FOR PROGRAMMING HETEROGENEOUS MULTI-CORE ARCHITECTURES
Duran, Alejandro
Ayguade, Eduard
Badia, Rosa M.
Labahta, Jesus
Martinell, Luis
Martorell, Xavier
Planas, Judit
[J]. PARALLEL PROCESSING LETTERS, 2011, 21 (02) : 173 - 193

← 1 2 3 →