Coping with recall and precision of soft error detectors

被引:10
作者
Bautista-Gomez, Leonardo [1 ]
Benoit, Anne [2 ,3 ]
Cavelan, Aurelien [2 ,3 ]
Raina, Saurabh K. [4 ]
Robert, Yves [2 ,3 ,5 ]
Sun, Hongyang [2 ,3 ]
机构
[1] Argonne Natl Lab, Argonne, IL 60439 USA
[2] Ecole Normale Super Lyon, Lyon, France
[3] INRIA, Rocquencourt, France
[4] Jaypee Inst Informat Technol, Noida, Uttar Pradesh, India
[5] Univ Tennessee, Knoxville, TN USA
关键词
Fault tolerance; High-performance computing; Silent data corruption; Partial verification; Recall and precision; Exascale; FAULT-TOLERANCE; REDUNDANCY;
D O I
10.1016/j.jpdc.2016.07.007
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Many methods are available to detect silent errors in high-performance computing (HPC) applications. Each method comes with a cost, a recall (fraction of all errors that are actually detected, i.e., false negatives), and a precision (fraction of true errors amongst all detected errors, i.e., false positives). The main contribution of this paper is to characterize the optimal computing pattern for an application: which detector(s) to use, how many detectors of each type to use, together with the length of the work segment that precedes each of them. We first prove that detectors with imperfect precisions offer limited usefulness. Then we focus on detectors with perfect precision, and we conduct a comprehensive complexity analysis of this optimization problem, showing NP-completeness and designing an FPTAS (Fully Polynomial-Time Approximation Scheme). On the practical side, we provide a greedy algorithm, whose performance is shown to be close to the optimal fora realistic set of evaluation scenarios. Extensive simulations illustrate the usefulness of detectors with false negatives, which are available at a lower cost than the guaranteed detectors. (C) 2016 Elsevier Inc. All rights reserved.
引用
收藏
页码:8 / 24
页数:17
相关论文
共 40 条
  • [11] Benoit A., 2014, P 5 INT WORKSH PERF, P215
  • [12] Benson A.R., 2014, INT J HIGH PERFORM C, P1
  • [13] Berrocal E., 2015, P ACM INT S HIGH PER
  • [14] Algorithm-based fault tolerance applied to high performance computing
    Bosilca, George
    Delmas, Remi
    Dongarra, Jack
    Langou, Julien
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2009, 69 (04) : 410 - 416
  • [15] Bridges P., 2012, ARXIV E PRINTS
  • [16] Bronevetsky G, 2008, ICS'08: PROCEEDINGS OF THE 2008 ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, P155
  • [17] TOWARD EXASCALE RESILIENCE
    Cappello, Franck
    Geist, Al
    Gropp, Bill
    Kale, Laxmikant
    Kramer, Bill
    Snir, Marc
    [J]. INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2009, 23 (04) : 374 - 388
  • [18] Cavelan A., 2015, P 44 ANN INT C PAR P
  • [19] DISTRIBUTED SNAPSHOTS - DETERMINING GLOBAL STATES OF DISTRIBUTED SYSTEMS
    CHANDY, KM
    LAMPORT, L
    [J]. ACM TRANSACTIONS ON COMPUTER SYSTEMS, 1985, 3 (01): : 63 - 75
  • [20] Online-ABFT: An Online Algorithm Based Fault Tolerance Scheme for Soft Error Detection in Iterative Methods
    Chen, Zizhong
    [J]. ACM SIGPLAN NOTICES, 2013, 48 (08) : 167 - 176