Impact of Tensor Cores and Mixed Precision on the Reliability of Matrix Multiplication in GPUs

被引:24
作者
Basso, Pedro Martins [1 ]
dos Santos, Fernando Fernandes [1 ]
Rech, Paolo [1 ]
机构
[1] Univ Fed Rio Grande do Sul, Inst Informat, BR-91509900 Porto Alegre, RS, Brazil
关键词
Tensile stress; Reliability; Graphics processing units; Error correction codes; Computer architecture; Object detection; Kernel; Graphics processing unit (GPU); matrix multiplication (MxM); neutrons; reliability; soft errors; tensor core;
D O I
10.1109/TNS.2020.2977583
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Matrix multiplication (MxM) is a cornerstone application for both high-performance computing and safety-critical applications. Most of the operations in convolutional neural networks for object detection, in fact, are MxM related. Chip designers are proposing novel solutions to improve the efficiency of the execution of MxM. In this article, we investigate the impact of two novel architectures for MxM (i.e., tensor cores and mixed precision) on the graphics processing units (GPUs) reliability. In addition, we evaluate how effective the embedded error-correcting code is in reducing the MxM error rate. Our results show that low-precision operations are more reliable, and the tensor core increases the amount of data correctly produced by the GPU. However, reducing precision and the use of tensor core significantly increase the impact of faults in the output correctness.
引用
收藏
页码:1560 / 1565
页数:6
相关论文
共 14 条
[1]   Solving lattice QCD systems of equations using mixed precision solvers on GPUs [J].
Clark, M. A. ;
Babich, R. ;
Barros, K. ;
Brower, R. C. ;
Rebbi, C. .
COMPUTER PHYSICS COMMUNICATIONS, 2010, 181 (09) :1517-1528
[2]   Analyzing and Increasing the Reliability of Convolutional Neural Networks on GPUs [J].
dos Santos, Fernando Fernandes ;
Pimenta, Pedro Foletto ;
Lunardi, Caio ;
Draghetti, Lucas ;
Carro, Luigi ;
Kaeli, David ;
Rech, Paolo .
IEEE TRANSACTIONS ON RELIABILITY, 2019, 68 (02) :663-677
[3]   Reliability Evaluation of Mixed-Precision Architectures [J].
dos Santos, Fernando Fernandes ;
Lunardi, Caio ;
Oliveira, Daniel ;
Libano, Fabiano ;
Rech, Paolo .
2019 25TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA), 2019, :238-249
[4]   Code-Dependent and Architecture-Dependent Reliability Behaviors [J].
Fratin, Vinicius ;
Oliveira, Daniel ;
Lunardi, Caio ;
Santos, Fernando ;
Rodrigues, Gennaro ;
Rech, Paolo .
2018 48TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN), 2018, :13-26
[5]   Evaluation and Mitigation of Radiation-Induced Soft Errors in Graphics Processing Units [J].
Goncalves de Oliveira, Daniel Alfonso ;
Pilla, Laercio Lima ;
Santini, Thiago ;
Rech, Paolo .
IEEE TRANSACTIONS ON COMPUTERS, 2016, 65 (03) :791-804
[6]  
Jia Zhihao, 2018, abs/1804.06826
[7]  
Lomont C., 2011, Introduction to Intel advanced vector extensions
[8]   On the Efficacy of ECC and the Benefits of FinFET Transistor Layout for GPU Reliability [J].
Lunardi, Caio ;
Previlon, Fritz ;
Kaeli, David ;
Rech, Paolo .
IEEE TRANSACTIONS ON NUCLEAR SCIENCE, 2018, 65 (08) :1843-1850
[9]  
Ho NM, 2017, ASIA S PACIF DES AUT, P63, DOI 10.1109/ASPDAC.2017.7858297
[10]  
NVIDIA, 2017, Technical Report 1.1