Experimental and Analytical Analysis of Sorting Algorithms Error Criticality for HPC and Large Servers Applications

被引:3
作者
Lunardi, Caio [1 ]
Quinn, Heather [2 ]
Monroe, Laura [2 ]
Oliveira, Daniel [1 ]
Navaux, Philippe [1 ]
Rech, Paolo [1 ]
机构
[1] Univ Fed Rio Grande do Sul, Inst Informat, BR-91509900 Porto Alegre, RS, Brazil
[2] Los Alamos Natl Lab, Los Alamos, NM 87545 USA
基金
欧盟地平线“2020”;
关键词
Graphics processing unit; neutron sensitivity; reliability; software fault injection; sorting algorithms; SOFT ERRORS; METHODOLOGY; RATES;
D O I
10.1109/TNS.2017.2727499
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In this paper, we investigate neutron-induced errors in three implementations of sort algorithms (QuickSort, Merge-Sort, and RadixSort) executed on modern graphics processing units designed for high-performance computing and large server applications. We measure the radiation-induced error rate of sort algorithms taking advantage of the neutron beam available at the Los Alamos Neutron Science Center facility. We also analyze output error criticality by identifying specific output error patterns. We found that radiation can cause wrong elements to appear in the sorted array, misalign values as well as application crashes or system hangs. This paper presents results showing that the criticality of the radiation-induced output error pattern depends on the application. Additionally, an extensive fault-injection campaign has been performed. This campaign allows for better understanding of the observed phenomena. We take advantage of SASS-assembly Intrumentator Fault Injector developed by NVIDIA, which can inject faults into all the useraccessible architectural state. Comparing fault-injection results with radiation experiments data provides an understanding that not all the output errors observed under radiation can be replicated in fault injection. However, fault injection is useful in identifying possible root causes of the output errors observed in radiation testing. Finally, we take advantage of our experimental and analytical study to design efficient experimentally tuned hardening strategies. We detect the error patterns that are critical to the final application and find the more efficient way to detect them. With an overhead as low as 16% of the execution time, we are able to reduce the output error rate of sort of about one order of magnitude.
引用
收藏
页码:2169 / 2178
页数:10
相关论文
共 37 条
[11]   Evaluation and Mitigation of Radiation-Induced Soft Errors in Graphics Processing Units [J].
Goncalves de Oliveira, Daniel Alfonso ;
Pilla, Laercio Lima ;
Santini, Thiago ;
Rech, Paolo .
IEEE TRANSACTIONS ON COMPUTERS, 2016, 65 (03) :791-804
[12]   Towards Building Resilient Scientific Applications: Resilience Analysis on the Impact of Soft Error and Transient Error Tolerance With the CLAMR Hydrodynamics Mini-App [J].
Guan, Qiang ;
DeBardeleben, Nathan ;
Atkinson, Brian ;
Robey, Robert ;
Jones, William M. .
2015 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING - CLUSTER 2015, 2015, :176-179
[13]  
Hari S. K. S., 2015, IEEE INT S PERF AN S
[14]   System Design Framework and Methodology for Xilinx Virtex FPGA Configuration Scrubbers [J].
Herrera-Alzu, I. ;
Lopez-Vallejo, M. .
IEEE TRANSACTIONS ON NUCLEAR SCIENCE, 2014, 61 (01) :619-629
[15]   QUICKSORT [J].
HOARE, CAR .
COMPUTER JOURNAL, 1962, 5 (01) :10-&
[16]   Rapid Star Tracking Algorithm for Star Sensor [J].
Jiang, Jie ;
Zhang, Guangjun ;
Wei, Xinguo ;
Li, Xiao .
IEEE AEROSPACE AND ELECTRONIC SYSTEMS MAGAZINE, 2009, 24 (09) :23-33
[17]  
Krane K. S., 1988, INTRO NUCL PHYS, V465
[18]  
Li D., 2012, P INT C HIGH PERFORM, P1
[19]  
Lucas R., 2014, DOE ADV SCI COMPUTIN, DOI [10.2172/1222713, DOI 10.2172/1222713]
[20]  
Lunardi C., 2016, IEEE RAD EFF COMP SY