Experimental and Analytical Analysis of Sorting Algorithms Error Criticality for HPC and Large Servers Applications

被引:3
作者
Lunardi, Caio [1 ]
Quinn, Heather [2 ]
Monroe, Laura [2 ]
Oliveira, Daniel [1 ]
Navaux, Philippe [1 ]
Rech, Paolo [1 ]
机构
[1] Univ Fed Rio Grande do Sul, Inst Informat, BR-91509900 Porto Alegre, RS, Brazil
[2] Los Alamos Natl Lab, Los Alamos, NM 87545 USA
基金
欧盟地平线“2020”;
关键词
Graphics processing unit; neutron sensitivity; reliability; software fault injection; sorting algorithms; SOFT ERRORS; METHODOLOGY; RATES;
D O I
10.1109/TNS.2017.2727499
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In this paper, we investigate neutron-induced errors in three implementations of sort algorithms (QuickSort, Merge-Sort, and RadixSort) executed on modern graphics processing units designed for high-performance computing and large server applications. We measure the radiation-induced error rate of sort algorithms taking advantage of the neutron beam available at the Los Alamos Neutron Science Center facility. We also analyze output error criticality by identifying specific output error patterns. We found that radiation can cause wrong elements to appear in the sorted array, misalign values as well as application crashes or system hangs. This paper presents results showing that the criticality of the radiation-induced output error pattern depends on the application. Additionally, an extensive fault-injection campaign has been performed. This campaign allows for better understanding of the observed phenomena. We take advantage of SASS-assembly Intrumentator Fault Injector developed by NVIDIA, which can inject faults into all the useraccessible architectural state. Comparing fault-injection results with radiation experiments data provides an understanding that not all the output errors observed under radiation can be replicated in fault injection. However, fault injection is useful in identifying possible root causes of the output errors observed in radiation testing. Finally, we take advantage of our experimental and analytical study to design efficient experimentally tuned hardening strategies. We detect the error patterns that are critical to the final application and find the more efficient way to detect them. With an overhead as low as 16% of the execution time, we are able to reduce the output error rate of sort of about one order of magnitude.
引用
收藏
页码:2169 / 2178
页数:10
相关论文
共 37 条
[1]  
[Anonymous], 2010, P INT C HIGH PERF CO, DOI DOI 10.1109/SC.2010.18
[2]  
[Anonymous], 2006, Measurement and Reporting of Alpha Particle and Terrestrial Cosmic Ray- Induced Soft Errors in Semiconductor Devices
[3]  
[Anonymous], P SELSE JAN
[4]  
Ansel J, 2009, INT PARALL DISTRIB P, P895
[5]   Radiation-induced soft errors in advanced semiconductor technologies [J].
Baumann, RC .
IEEE TRANSACTIONS ON DEVICE AND MATERIALS RELIABILITY, 2005, 5 (03) :305-316
[6]  
Bell N., 2011, GPU computing gems Jade edition, V2, P359
[7]  
Bronevetsky G, 2008, ICS'08: PROCEEDINGS OF THE 2008 ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, P155
[8]   Comparison of error rates in combinational and sequential logic [J].
Buchner, S ;
Baze, M ;
Brown, D ;
McMorrow, D ;
Melinger, J .
IEEE TRANSACTIONS ON NUCLEAR SCIENCE, 1997, 44 (06) :2209-2216
[9]  
Fang B, 2014, INT SYM PERFORM ANAL, P221, DOI 10.1109/ISPASS.2014.6844486
[10]  
Gomez L.B., 2014, Design, Automation and Test in Europe Conference and Exhibition DATE'14, P1, DOI DOI 10.7873/DATE.2014.354