Automatic Risk-based Selective Redundancy for Fault-tolerant Task-parallel HPC Applications

被引:5
作者
Subasi, Omer [1 ]
Unsal, Osman [2 ]
Krishnamoorthy, Sriram [1 ]
机构
[1] Pacific Northwest Natl Lab, Richland, WA 99352 USA
[2] Barcelona Supercomp Ctr, Barcelona, Spain
来源
PROCEEDINGS OF ESPM2 2017: THIRD INTERNATIONAL WORKSHOP ON EXTREME SCALE PROGRAMMING MODELS AND MIDDLEWARE | 2017年
关键词
Fault-tolerance; selective redundancy; task-parallelism; dataflow;
D O I
10.1145/3152041.3152083
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Silent data corruption (SDC) and fail-stop errors are the most hazardous error types in high-performance computing (HPC) systems. In this study, we present an automatic, efficient and lightweight redundancy mechanism to mitigate both error types. We propose partial task-replication and checkpointing for task-parallel HPC applications to mitigate silent and fail-stop errors. To avoid the prohibitive costs of complete replication, we introduce a lightweight selective replication mechanism. Using a fully automatic and transparent heuristics, we identify and selectively replicate only the reliability-critical tasks based on a risk metric. Our approach detects and corrects around 70% of silent errors with only 5% average performance overhead. Additionally, the performance overhead of the heuristic itself is negligible.
引用
收藏
页数:8
相关论文
共 21 条
  • [1] Amer Abdelhalim, ISC13
  • [2] Andersch M., SAMOS12
  • [3] Andersch M., PARS11
  • [4] [Anonymous], Intel TBB 4.3 Update 2.
  • [5] [Anonymous], BSC APPL REPOSITORY
  • [6] Bridges Patrick G., CORR12
  • [7] Cappello Franck, IJPCA09
  • [8] Chung Jinsuk, SC12
  • [9] Dongarra Jack, IJPCA11
  • [10] OnipSs: A PROPOSAL FOR PROGRAMMING HETEROGENEOUS MULTI-CORE ARCHITECTURES
    Duran, Alejandro
    Ayguade, Eduard
    Badia, Rosa M.
    Labahta, Jesus
    Martinell, Luis
    Martorell, Xavier
    Planas, Judit
    [J]. PARALLEL PROCESSING LETTERS, 2011, 21 (02) : 173 - 193