An Efficient Silent Data Corruption Detection Method with Error-Feedback Control and Even Sampling for HPC Applications

被引:26
作者
Di, Sheng [1 ]
Berrocal, Eduardo [2 ]
Cappello, Franck [1 ,3 ]
机构
[1] Argonne Natl Lab, Argonne, IL 60439 USA
[2] IIT, Chicago, IL 60616 USA
[3] Univ Illinois, Urbana, IL 61801 USA
来源
2015 15TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING | 2015年
关键词
D O I
10.1109/CCGrid.2015.17
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The silent data corruption (SDC) problem is attracting more and more attentions because it is expected to have a great impact on exascale HPC applications. SDC faults are hazardous in that they pass unnoticed by hardware and can lead to wrong computation results. In this work, we formulate SDC detection as a runtime one-step-ahead prediction method, leveraging multiple linear prediction methods in order to improve the detection results. The contributions are twofold: (1) we propose an error feedback control model that can reduce the prediction errors for different linear prediction methods, and (2) we propose a spatial-data-based even-sampling method to minimize the detection overheads (including memory and computation cost). We implement our algorithms in the fault tolerance interface, a fault tolerance library with multiple checkpoint levels, such that users can conveniently protect their HPC applications against both SDC errors and fail-stop errors. We evaluate our approach by using large-scale traces from well-known, large-scale HPC applications, as well as by running those HPC applications on a real cluster environment. Experiments show that our error feedback control model can improve detection sensitivity by 34-189% for bit-flip memory errors injected with the bit positions in the range [20,30], without any degradation on detection accuracy. Furthermore, memory size can be reduced by 33% with our spatial-data even-sampling method, with only a slight and graceful degradation in the detection sensitivity.
引用
收藏
页码:271 / 280
页数:10
相关论文
共 20 条
[1]  
[Anonymous], 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
[2]  
Bautista-Gomez L., 2011, P 2011 INT C HIGH PE, DOI DOI 10.1145/2063384.2063427
[3]  
Benson A.R., 2014, INT J HIGH IN PRESS
[4]  
Berrocal E., LIGHTWEIGHT SILENT D
[5]  
CHALERMARREWONG T, 2012, 18 P INT C PAR DISTR, P794, DOI DOI 10.1109/ICPADS.2012.129
[6]  
Chen Z., 2013, 18 ACM SIGPLAN S PRI, P167
[7]   Application-level fault tolerance in the orbital thermal imaging spectrometer [J].
Ciocca, E ;
Koren, I ;
Koren, Z ;
Krishna, CM ;
Katz, DS .
10TH IEEE PACIFIC RIM INTERNATIONAL SYMPOSIUM ON DEPENDABLE COMPUTING, PROCEEDINGS, 2004, :43-48
[8]  
Di S., SC 14
[9]   Optimization of a Multilevel Checkpoint Model with Uncertain Execution Scales [J].
Di, Sheng ;
Bautista-Gomez, Leonardo ;
Cappello, Franck .
SC14: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2014, :907-918
[10]  
Feinberg A., 2013, 83000 PROCESSOR SUPE