Predicting the Soft Error Vulnerability of GPGPU Applications

被引:1
作者
Topcu, Burak [1 ]
Oz, Isil [1 ]
机构
[1] Izmir Inst Technol, Comp Engn Dept, Izmir, Turkey
来源
30TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING (PDP 2022) | 2022年
关键词
D O I
10.1109/PDP55904.2022.00025
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
As Graphics Processing Units (GPUs) have evolved to deliver performance increases for general-purpose computations as well as graphics and multimedia applications, soft error reliability becomes an important concern. The soft error vulnerability of the applications is evaluated via fault injection experiments. Since performing fault injection takes impractical times to cover the fault locations in complex GPU hardware structures, prediction-based techniques have been proposed to evaluate the soft error vulnerability of General-Purpose GPU (GPGPU) programs based on the hardware performance characteristics. In this work, we propose ML-based prediction models for the soft error vulnerability evaluation of GPGPU programs. We consider both program characteristics and hardware performance metrics collected from either the simulation or the profiling tools. While we utilize regression models for the prediction of the masked fault rates, we build classification models to specify the vulnerability level of the programs based on their silent data corruption (SDC) and crash rates. Our prediction models achieve maximum prediction accuracy rates of 96.6%, 82.6%, and 87% for masked fault rates, SDCs, and crashes, respectively.
引用
收藏
页码:108 / 115
页数:8
相关论文
共 25 条
  • [1] Aamodt T, 2018, Synthesis Lectures on Computer Architecture, V13, P1
  • [2] [Anonymous], 2020, NVBITFI ARCHITECTURE
  • [3] [Anonymous], NVIDIA NSIGHT COMPUT
  • [4] [Anonymous], 2021, NVIDIA PARALLEL THRE
  • [5] Dimitrov M, 2009, WORKSH GEN PURP PROC
  • [6] A Systematic Methodology for Evaluating the Error Resilience of GPGPU Applications
    Fang, Bo
    Pattabiraman, Karthik
    Ripeanu, Matei
    Gurumurthi, Sudhanva
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2016, 27 (12) : 3397 - 3411
  • [7] Grauer-Gray S., 2012, 2012 INNOVATIVE PARA
  • [8] PARIS: Predicting application resilience using machine learning
    Guo, Luanzheng
    Li, Dong
    Laguna, Ignacio
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2021, 152 : 111 - 124
  • [9] Predicting Faults in High Performance Computing Systems: An In-Depth Survey of the State-of-the-Practice
    Jauk, David
    Yang, Dai
    Schulz, Martin
    [J]. PROCEEDINGS OF SC19: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2019,
  • [10] Kalra C, 2018, PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE, AND ANALYSIS (SC'18)