On the Efficacy of ECC and the Benefits of FinFET Transistor Layout for GPU Reliability

被引:28
作者
Lunardi, Caio [1 ]
Previlon, Fritz [2 ]
Kaeli, David [2 ]
Rech, Paolo [1 ]
机构
[1] Univ Fed Rio Grande do Sul, Inst Informat, BR-91501970 Porto Alegre, RS, Brazil
[2] Northeastern Univ, Dept Elect & Comp Engn, Boston, MA 02115 USA
基金
欧盟地平线“2020”;
关键词
Error correction codes (ECCs); parallel architectures; reliability; ERROR RATES; SOFT ERRORS; VULNERABILITY; TOLERANCE;
D O I
10.1109/TNS.2018.2823786
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Using error-correcting codes (ECCs) is considered one of the most effective ways to mask the effects of radiation-induced faults in memory and computing devices. Unfortunately, with the increased complexity of modern processors, there is a growing amount of hidden logic and memory resources, such as flip-flops in internal pipelines and queues, that cannot be easily protected by ECC. In this paper, we experimentally investigate the efficacy of using ECC to mask neutron-induced faults in modern graphics processing units (GPUs). In our analysis, we consider GPUs fabricated in CMOS and FinFET technologies. We show that changes in transistor technology can be as beneficial as using ECC for reducing silent data corruption rates. Finally, we compare fault-injection results, as carried out both on internal registers and at an instruction level, to better understand the effectiveness of ECC.
引用
收藏
页码:1843 / 1850
页数:8
相关论文
共 44 条
[1]  
Alles M., 2011, PSYCHOL INJ LAW, P1
[2]  
[Anonymous], 2006, JESD89A JEDEC STAND
[3]  
[Anonymous], 2014, TEGR K1 TECHN REF MA
[4]  
[Anonymous], 2012, KEPL GK110 DAT
[5]  
[Anonymous], 2016, CUDA TOOLK
[6]  
Asanovic K., 2006, UCBEECS2006183 U CAL
[7]   Radiation-induced soft errors in advanced semiconductor technologies [J].
Baumann, RC .
IEEE TRANSACTIONS ON DEVICE AND MATERIALS RELIABILITY, 2005, 5 (03) :305-316
[8]  
Bolch G., 2006, Queueing Networks and Markov Chains: Modeling and Performance Evaluation with Computer Science Applications
[9]  
Breuer MA, 2005, DSD 2005: 8TH EUROMICRO CONFERENCE ON DIGITAL SYSTEM DESIGN, PROCEEDINGS, P2
[10]   Defect and error tolerance in the presence of massive numbers of defects [J].
Breuer, MA ;
Gupta, SK .
IEEE DESIGN & TEST OF COMPUTERS, 2004, 21 (03) :216-227