Understanding Error Propagation in GPGPU Applications

被引:0
|
作者
Li, Guanpeng [1 ]
Pattabiraman, Karthik [1 ]
Cher, Chen-Yong [2 ]
Bose, Pradip [2 ]
机构
[1] Univ British Columbia, Vancouver, BC, Canada
[2] IBM TJ Watson Res Ctr, New York, NY USA
来源
SC '16: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS | 2016年
基金
加拿大创新基金会; 加拿大自然科学与工程研究理事会;
关键词
Fault Injection; Error Resilience; GPGPU; CUDA; Error Propagation; RESILIENCE;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
GPUs have emerged as general-purpose accelerators in high-performance computing (HPC) and scientific applications. However, the reliability characteristics of GPU applications have not been investigated in depth. While error propagation has been extensively investigated for non-GPU applications, GPU applications have a very different programming model which can have a significant effect on error propagation in them. We perform an empirical study to understand and characterize error propagation in GPU applications. We build a compiler-based fault-injection tool for GPU applications to track error propagation, and define metrics to characterize propagation in GPU applications. We find GPU applications exhibit significant error propagation for some kinds of errors, but not others, and the behaviour is highly application specific. We observe the GPU-CPU interaction boundary naturally limits error propagation in these applications compared to traditional non-GPU applications. We also formulate various guidelines for the design of fault-tolerance mechanisms in GPU applications based on our results.
引用
收藏
页码:240 / 251
页数:12
相关论文
共 50 条
  • [21] Understanding Logical-Shift Error Propagation in Quanvolutional Neural Networks
    Vallero, Marzio
    Dri, Emanuele
    Giusto, Edoardo
    Montrucchio, Bartolomeo
    Rech, Paolo
    IEEE TRANSACTIONS ON QUANTUM ENGINEERING, 2024, 5 : 1 - 14
  • [22] Analysis of the Impact Factors on Data Error Propagation in HPC applications
    Utrera, Gladys
    Gil, Marisa
    Martorell, Xavier
    2018 26TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, AND NETWORK-BASED PROCESSING (PDP 2018), 2018, : 546 - 549
  • [23] Tracing Error Propagation in C/C plus plus Applications
    Kong, Shiyi
    Lu, Minyan
    Li, Luyi
    2018 IEEE 18TH INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY COMPANION (QRS-C), 2018, : 308 - 315
  • [24] SSAGA: SMs Synthesized for Asymmetric GPGPU Applications
    Saha, Shamik
    Basu, Prabal
    Rajamanikkam, Chidhambaranathan
    Bal, Aatreyi
    Chakraborty, Koushik
    Roy, Sanghamitra
    ACM TRANSACTIONS ON DESIGN AUTOMATION OF ELECTRONIC SYSTEMS, 2017, 22 (03)
  • [25] Quantifying the NUMA Behavior of Partitioned GPGPU Applications
    Matz, Alexander
    Froening, Holger
    12TH WORKSHOP ON GENERAL PURPOSE PROCESSING USING GPUS (GPGPU 12), 2019, : 53 - 62
  • [26] MERCATOR: a GPGPU Framework for Irregular Streaming Applications
    Cole, Stephen V.
    Buhler, Jeremy
    2017 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS), 2017, : 727 - 736
  • [27] Soft error resilient QR factorization for hybrid system with GPGPU
    Du, Peng
    Luszczek, Piotr
    Tomov, Stan
    Dongarra, Jack
    JOURNAL OF COMPUTATIONAL SCIENCE, 2013, 4 (06) : 457 - 464
  • [28] GPGPU-Perf: efficient, interval-based DVFS algorithm for mobile GPGPU applications
    SeongKi Kim
    Young J. Kim
    The Visual Computer, 2015, 31 : 1045 - 1054
  • [29] ERROR IN THE PROPAGATION OF ERROR FORMULA
    PARK, SW
    HIMMELBLAU, DM
    AICHE JOURNAL, 1980, 26 (01) : 168 - 170
  • [30] Understanding Calibration and Error Propagation in Longitudinal and Lateral Manganin Gauge Shock Experiments
    J. L. Jordan
    D. T. Casem
    Journal of Dynamic Behavior of Materials, 2021, 7 : 188 - 195