Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications

被引:30
作者
Di, Sheng [1 ]
Cappello, Franck [1 ]
机构
[1] Argonne Natl Lab, Math & Comp Sci MCS Div, Lemont, IL 60439 USA
关键词
Fault tolerance; silent data corruption; exascale HPC;
D O I
10.1109/TPDS.2016.2517639
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
For exascale HPC applications, silent data corruption (SDC) is one of the most dangerous problems because there is no indication that there are errors during the execution. We propose an adaptive impact-driven method that can detect SDCs dynamically. The key contributions are threefold. (1) We carefully characterize 18 HPC applications/benchmarks and discuss the runtime data features, as well as the impact of the SDCs on their execution results. (2) We propose an impact-driven detection model that does not blindly improve the prediction accuracy, but instead detects only influential SDCs to guarantee user-acceptable execution results. (3) Our solution can adapt to dynamic prediction errors based on local runtime data and can automatically tune detection ranges for guaranteeing low false alarms. Experiments show that our detector can detect 80-99.99 percent of SDCs with a false alarm rate less that 1 percent of iterations for most cases. The memory cost and detection overhead are reduced to 15 and 6.3 percent, respectively, for a large majority of applications.
引用
收藏
页码:2809 / 2823
页数:15
相关论文
共 43 条
[1]  
[Anonymous], 2015, International Journal of Networking and Computing
[2]  
[Anonymous], 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
[3]  
[Anonymous], 2018, Similarity and dimensional methods in mechanics
[4]  
[Anonymous], 2015, FLASH US GUID VERS 4
[5]  
Bautista-Gomez L., 2011, P 2011 INT C HIGH PE, DOI DOI 10.1145/2063384.2063427
[6]   Silent error detection in numerical time-stepping schemes [J].
Benson, Austin R. ;
Schmit, Sven ;
Schreiber, Robert .
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2015, 29 (04) :403-421
[7]  
Berrocal E., 2015, TECH REP
[8]  
Berrocal Eduardo., 2015, P 24 INT S HIGH PERF, P275
[9]   AN UPWIND DIFFERENCING SCHEME FOR THE EQUATIONS OF IDEAL MAGNETOHYDRODYNAMICS [J].
BRIO, M ;
WU, CC .
JOURNAL OF COMPUTATIONAL PHYSICS, 1988, 75 (02) :400-422
[10]   Failure Prediction of Data Centers Using Time Series and Fault Tree Analysis [J].
Chalermarrewong, Thanyalak ;
Achalakul, Tiranee ;
See, Simon Chong Wee .
PROCEEDINGS OF THE 2012 IEEE 18TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS 2012), 2012, :794-799