Evaluation and Mitigation of Radiation-Induced Soft Errors in Graphics Processing Units

被引:81
作者
Goncalves de Oliveira, Daniel Alfonso [1 ]
Pilla, Laercio Lima [2 ]
Santini, Thiago [1 ]
Rech, Paolo [1 ]
机构
[1] Fed Univ Rio Grande Sul UFRGS, Inst Informat, Av Bento Goncalves,9500 Campus Vale Bloco 4, Porto Alegre, RS, Brazil
[2] Univ Fed Santa Catarina, Dept Informat & Stat, Campus Univ Reitor Joao David Ferreira Lima, Florianopolis, SC, Brazil
关键词
GPU; fault-tolerance; neutron sensitivity; parallel processors; reliability; PERFORMANCE; TOLERANCE;
D O I
10.1109/TC.2015.2444855
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Graphics processing units (GPUs) are increasingly attractive for both safety-critical and High-Performance Computing applications. GPU reliability is a primary concern for both the automotive and aerospace markets and is becoming an issue also for supercomputers. In fact, the high number of devices in large data centers makes the probability of having at least a device corrupted to be very high. In this paper, we aim at giving novel insights on GPU reliability by evaluating the neutron sensitivity of modern GPUs memory structures, highlighting pattern dependence and multiple errors occurrences. Additionally, a wide set of parallel codes are exposed to controlled neutron beams to measure GPUs operative error rates. From experimental data and algorithm analysis we derive general insights on parallel algorithms and programming approaches reliability. Finally, error-correcting code, algorithm-based fault tolerance, and duplication with comparison hardening strategies are presented and evaluated on GPUs through radiation experiments. We present and compare both the reliability improvement and imposed overhead of the selected hardening solutions.
引用
收藏
页码:791 / 804
页数:14
相关论文
共 45 条
[1]  
[Anonymous], 2013, 2013 IEEE RAD EFF DA
[2]  
[Anonymous], 2006, Measurement and Reporting of Alpha Particle and Terrestrial Cosmic Ray- Induced Soft Errors in Semiconductor Devices
[3]  
[Anonymous], 2013, ARAMIS PROJECT OVERV
[4]   Balancing performance and reliability in the memory hierarchy [J].
Asadi, GH ;
Sridharan, V ;
Tahoori, MB ;
Kaeli, D .
ISPASS 2005: IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE, 2005, :269-279
[5]   Models and algorithmic limits for an ECC-Based approach to hardening sub-100-nm SRAMs [J].
Bajura, Michael A. ;
Boulghassoul, Younes ;
Naseer, Riaz ;
DasGupta, Sandeepan ;
Witulski, Arthur F. ;
Sondeen, Jeff ;
Stansberry, Scott D. ;
Draper, Jeffrey ;
Massengill, Lloyd W. ;
Damoulakis, John N. .
IEEE TRANSACTIONS ON NUCLEAR SCIENCE, 2007, 54 (04) :935-945
[6]   Radiation-induced soft errors in advanced semiconductor technologies [J].
Baumann, RC .
IEEE TRANSACTIONS ON DEVICE AND MATERIALS RELIABILITY, 2005, 5 (03) :305-316
[7]   A-ABFT: Autonomous Algorithm-Based Fault Tolerance for Matrix Multiplications on Graphics Processing Units [J].
Braun, Claus ;
Halder, Sebastian ;
Wunderlich, Hans-Joachim .
2014 44TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN), 2014, :443-454
[8]   Defect and error tolerance in the presence of massive numbers of defects [J].
Breuer, MA ;
Gupta, SK .
IEEE DESIGN & TEST OF COMPUTERS, 2004, 21 (03) :216-227
[9]   Comparison of error rates in combinational and sequential logic [J].
Buchner, S ;
Baze, M ;
Brown, D ;
McMorrow, D ;
Melinger, J .
IEEE TRANSACTIONS ON NUCLEAR SCIENCE, 1997, 44 (06) :2209-2216
[10]  
Chong Ding, 2011, 2011 IEEE 9th International Symposium on Parallel and Distributed Processing with Applications (ISPA), P311, DOI 10.1109/ISPA.2011.50