Understanding Error Propagation in GPGPU Applications

被引:0
|
作者
Li, Guanpeng [1 ]
Pattabiraman, Karthik [1 ]
Cher, Chen-Yong [2 ]
Bose, Pradip [2 ]
机构
[1] Univ British Columbia, Vancouver, BC, Canada
[2] IBM TJ Watson Res Ctr, New York, NY USA
来源
SC '16: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS | 2016年
基金
加拿大创新基金会; 加拿大自然科学与工程研究理事会;
关键词
Fault Injection; Error Resilience; GPGPU; CUDA; Error Propagation; RESILIENCE;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
GPUs have emerged as general-purpose accelerators in high-performance computing (HPC) and scientific applications. However, the reliability characteristics of GPU applications have not been investigated in depth. While error propagation has been extensively investigated for non-GPU applications, GPU applications have a very different programming model which can have a significant effect on error propagation in them. We perform an empirical study to understand and characterize error propagation in GPU applications. We build a compiler-based fault-injection tool for GPU applications to track error propagation, and define metrics to characterize propagation in GPU applications. We find GPU applications exhibit significant error propagation for some kinds of errors, but not others, and the behaviour is highly application specific. We observe the GPU-CPU interaction boundary naturally limits error propagation in these applications compared to traditional non-GPU applications. We also formulate various guidelines for the design of fault-tolerance mechanisms in GPU applications based on our results.
引用
收藏
页码:240 / 251
页数:12
相关论文
共 50 条
  • [1] Regional soft error vulnerability and error propagation analysis for GPGPU applications
    Işıl Öz
    Ömer Faruk Karadaş
    The Journal of Supercomputing, 2022, 78 : 4095 - 4130
  • [2] Regional soft error vulnerability and error propagation analysis for GPGPU applications
    Oz, Isil
    Karadas, Omer Faruk
    JOURNAL OF SUPERCOMPUTING, 2022, 78 (03): : 4095 - 4130
  • [3] Evaluating Error Resiliency of GPGPU Applications
    Fang, Bo
    Wei, Jiesheng
    Pattabiraman, Karthik
    Ripeanu, Matei
    2012 SC COMPANION: HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SCC), 2012, : 1502 - 1503
  • [4] Evaluating Error Resilience of GPGPU Applications
    Fang, Bo
    Wei, Jiesheng
    Pattabiraman, Karthik
    Ripeanu, Matei
    2012 SC COMPANION: HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SCC), 2012, : 1504 - 1504
  • [5] Predicting the Soft Error Vulnerability of GPGPU Applications
    Topcu, Burak
    Oz, Isil
    30TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING (PDP 2022), 2022, : 108 - 115
  • [6] Soft error vulnerability prediction of GPGPU applications
    Burak Topçu
    Işıl Öz
    The Journal of Supercomputing, 2023, 79 : 6965 - 6990
  • [7] Soft error vulnerability prediction of GPGPU applications
    Topcu, Burak
    Oz, Isil
    JOURNAL OF SUPERCOMPUTING, 2023, 79 (06): : 6965 - 6990
  • [8] Pannotia: Understanding Irregular GPGPU Graph Applications
    Che, Shuai
    Beckmann, Bradford M.
    Reinhardt, Steven K.
    Skadron, Kevin
    2013 IEEE INTERNATIONAL SYMPOSIUM ON WORKLOAD CHARACTERIZATION (IISWC 2013), 2013, : 185 - +
  • [9] A Systematic Methodology for Evaluating the Error Resilience of GPGPU Applications
    Fang, Bo
    Pattabiraman, Karthik
    Ripeanu, Matei
    Gurumurthi, Sudhanva
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2016, 27 (12) : 3397 - 3411
  • [10] Understanding Error Propagation in Deep Learning Neural Network (DNN) Accelerators and Applications
    Li, Guanpeng
    Hari, Siva Kumar Sastry
    Sullivan, Michael
    Tsai, Timothy
    Pattabiraman, Karthik
    Emer, Joel
    Keckler, Stephen W.
    SC'17: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2017,