Understanding Error Propagation in GPGPU Applications

被引:0
|
作者
Li, Guanpeng [1 ]
Pattabiraman, Karthik [1 ]
Cher, Chen-Yong [2 ]
Bose, Pradip [2 ]
机构
[1] Univ British Columbia, Vancouver, BC, Canada
[2] IBM TJ Watson Res Ctr, New York, NY USA
来源
SC '16: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS | 2016年
基金
加拿大创新基金会; 加拿大自然科学与工程研究理事会;
关键词
Fault Injection; Error Resilience; GPGPU; CUDA; Error Propagation; RESILIENCE;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
GPUs have emerged as general-purpose accelerators in high-performance computing (HPC) and scientific applications. However, the reliability characteristics of GPU applications have not been investigated in depth. While error propagation has been extensively investigated for non-GPU applications, GPU applications have a very different programming model which can have a significant effect on error propagation in them. We perform an empirical study to understand and characterize error propagation in GPU applications. We build a compiler-based fault-injection tool for GPU applications to track error propagation, and define metrics to characterize propagation in GPU applications. We find GPU applications exhibit significant error propagation for some kinds of errors, but not others, and the behaviour is highly application specific. We observe the GPU-CPU interaction boundary naturally limits error propagation in these applications compared to traditional non-GPU applications. We also formulate various guidelines for the design of fault-tolerance mechanisms in GPU applications based on our results.
引用
收藏
页码:240 / 251
页数:12
相关论文
共 50 条
  • [41] Understanding the Propagation of Error Due to a Silent Data Corruption in a Sparse Matrix Vector Multiply
    Calhoun, Jon
    Snir, Marc
    Olson, Luke
    Garzaran, Maria
    2015 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING - CLUSTER 2015, 2015, : 541 - 542
  • [42] Understanding Soft Error Propagation Using Efficient Vulnerability-Driven Fault Injection
    Xu, Xin
    Li, Man-Lap
    2012 42ND ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN), 2012,
  • [43] Hardware-in-the-loop Simulation of Android GPGPU Applications
    Ko, Youngsub
    Yi, Saehanseul
    Yi, Youngmin
    Kim, Myungsun
    Ha, Soonhoi
    2014 IEEE 12TH SYMPOSIUM ON EMBEDDED SYSTEMS FOR REAL-TIME MULTIMEDIA (ESTIMEDIA), 2014, : 108 - 117
  • [44] TCC: GPGPU Architecture for Instruction Decoder and Control Flow Error Detection
    Raghunandana, K. K.
    Prasad, Yogesh K. R.
    Reorda, M. Sonza
    Singh, Virendra
    2024 27TH INTERNATIONAL SYMPOSIUM ON DESIGN & DIAGNOSTICS OF ELECTRONIC CIRCUITS & SYSTEMS, DDECS, 2024, : 104 - 109
  • [45] Managing DRAM Latency Divergence in Irregular GPGPU Applications
    Chatterjee, Niladrish
    O'Connor, Mike
    Loh, Gabriel H.
    Jayasena, Nuwan
    Balasubramonian, Rajeev
    SC14: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2014, : 128 - 139
  • [46] Characterizing Accuracy-Aware Resilience of GPGPU Applications
    Nie, Bin
    Jog, Adwait
    Smirni, Evgenia
    2020 20TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2020), 2020, : 111 - 120
  • [47] Overcoming Limitations of GPGPU-Computing in Scientific Applications
    Kenyon, Connor
    Volkema, Glenn
    Khanna, Gaurav
    2019 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2019,
  • [48] A Comparative Analysis of Resource Requirements for Parallel Applications in GPGPU
    Thomas, Winnie
    Daruwala, Rohin D.
    TENCON 2015 - 2015 IEEE REGION 10 CONFERENCE, 2015,
  • [49] Warped-DMR: Light-weight Error Detection for GPGPU
    Jeon, Hyeran
    Annavaram, Murali
    2012 IEEE/ACM 45TH INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO-45), 2012, : 37 - 47
  • [50] TREFU: An Online Error Detecting and Correcting Fault Tolerant GPGPU Architecture
    Raghunandana, K. K.
    Varaprasad, B. K. S. V. L.
    Reorda, M. Sonza
    Singh, Virendra
    2023 IEEE 29TH INTERNATIONAL SYMPOSIUM ON ON-LINE TESTING AND ROBUST SYSTEM DESIGN, IOLTS, 2023,