Warped-DMR: Light-weight Error Detection for GPGPU

被引:39
作者
Jeon, Hyeran [1 ]
Annavaram, Murali [1 ]
机构
[1] Univ So Calif, Los Angeles, CA 90089 USA
来源
2012 IEEE/ACM 45TH INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO-45) | 2012年
关键词
GPU;
D O I
10.1109/MICRO.2012.13
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
General purpose graphics processing units (GPGPUs) are feature rich GPUs that provide general purpose computing ability with massive number of parallel threads. The massive parallelism combined with programmability made GPGPUs the most attractive choice in supercomputing centers. Unsurprisingly, most of the GPGPU-based studies have been focusing on performance improvement leveraging GPGPU's high degree of parallelism. However, for many scientific applications that commonly run on supercomputers, program correctness is as important as performance. Few soft or hard errors could lead to corrupt results and can potentially waste days or even months of computing effort. In this research we exploit unique architectural characteristics of GPGPUs to propose a light weight error detection method, called Warped Dual Modular Redundancy (Warped-DMR). Warped-DMR detects errors in computation by relying on opportunistic spatial and temporal dual-modular execution of code. Warped-DMR is light weight because it exploits the underutilized parallelism in GPGPU computing for error detection. Error detection spans both within a warp as well as between warps, called intra-warp and inter-warp DMR, respectively. Warped-DMR achieves 96% error coverage while incurring a worst-case 16% performance overhead without extra execution units or programmer's effort.
引用
收藏
页码:37 / 47
页数:11
相关论文
共 18 条
[1]  
[Anonymous], DES COMP US GUID
[2]  
[Anonymous], FERM WHIT PAP V1 1
[3]  
[Anonymous], P INT S COMP ARCH SA
[4]  
[Anonymous], WORKSH RES ARCH DEC
[5]  
[Anonymous], NVID GEF GTX 680 WHI
[6]  
Dimitrov M., 2009, P 2 WORKSH GEN PURP, P94
[7]  
Gebhart M, 2011, ISCA 2011: PROCEEDINGS OF THE 38TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, P235, DOI 10.1145/2024723.2000093
[8]  
Hong S, 2010, CONF PROC INT SYMP C, P280, DOI 10.1145/1816038.1815998
[9]  
Keun Soo Yim, 2011, Proceedings of the 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2011), P287, DOI 10.1109/IPDPS.2011.36
[10]  
Lee VW, 2010, CONF PROC INT SYMP C, P451, DOI 10.1145/1816038.1816021