LADR: Low-cost Application-level Detector for Reducing Silent Output Corruptions

被引:14
作者
Chen, Chao [1 ]
Eisenhauer, Greg [1 ]
Wolf, Matthew [2 ]
Pande, Santosh [1 ]
机构
[1] Georgia Inst Technol, Atlanta, GA 30332 USA
[2] Oak Ridge Natl Lab, Oak Ridge, TN USA
来源
HPDC '18: PROCEEDINGS OF THE 27TH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE PARALLEL AND DISTRIBUTED COMPUTING | 2018年
关键词
Resiliency; Transient Fault; Soft Error; Silent Data Corruption; Fault Tolerance; Exascale Computing;
D O I
10.1145/3208040.3208043
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Applications running on future high performance computing (HPC) systems are more likely to experience transient faults due to technology scaling trends with respect to higher circuit density, smaller transistor size and near-threshold voltage (NTV) operations. A transient fault could corrupt application state without warning, possibly leading to incorrect application output. Such errors are called silent data corruptions (SDCs). In this paper, we present LADR, a low-cost application-level SDC detector for scientific applications. LADR protects scientific applications from SDCs by watching for data anomalies in their state variables (those of scientific interest). It employs compile-time data-flow analysis to minimize the number of monitored variables, thereby reducing runtime and memory overheads while maintaining a high level of fault coverage with low false positive rates. We evaluated LADR with 4 scientific workloads and results show that LADR achieved >80% fault coverage with only similar to 3% runtime overheads and similar to 1% memory overheads. As compared to prior state-of-the-art anomaly-based detection methods, SDC achieved comparable or improved fault coverage, but reduced runtime overheads by 21% similar to 75%, and memory overheads by 35% similar to 55% for the evaluated workloads. We believe that such an approach with low memory and runtime overheads coupled with attractive detection precision makes LADR a viable approach for assuring the correct output from large-scale high performance simulations.
引用
收藏
页码:156 / 167
页数:12
相关论文
共 32 条
[1]  
Ashby S., 2010, The opportunities and challenges of exascale computing-summary report of the advanced scientific computing advisory committee (ASCAC) subcommittee
[2]  
Bautista-Gomez L., 2011, P 2011 INT C HIGH PE, DOI DOI 10.1145/2063384.2063427
[3]  
Berrocal Eduardo., 2015, P 24 INT S HIGH PERF, P275
[4]   Failure Analysis of Virtual and Physical Machines: Patterns, Causes and Characteristics [J].
Birke, Robert ;
Giurgiu, Ioana ;
Chen, Lydia Y. ;
Wiesmann, Dorothea ;
Engbersen, Ton .
2014 44TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN), 2014, :1-12
[5]  
Cappello Franck, 2014, [Supercomputing Frontiers and Innovations, Supercomputing Frontiers and Innovations], V1, P5
[6]   NUMARCK: Machine Learning Algorithm for Resiliency and Checkpointing [J].
Chen, Zhengzhang ;
Son, Seung Woo ;
Hendrix, William ;
Agrawal, Ankit ;
Liao, Wei-keng ;
Choudhary, Alok .
SC14: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2014, :733-744
[7]   SIMD-Based Soft Error Detection [J].
Chen, Zhi ;
Nicolau, Alexandru ;
Veidenbaum, Alexander V. .
PROCEEDINGS OF THE ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS (CF'16), 2016, :45-54
[8]  
Chen Z, 2015, PROCEEDINGS INTERNATIONAL CONFERENCE ON EMBEDDED COMPUTER SYSTEMS - ARCHITECTURES, MODELING AND SIMULATION (SAMOS XV), P203, DOI 10.1109/SAMOS.2015.7363677
[9]   Online-ABFT: An Online Algorithm Based Fault Tolerance Scheme for Soft Error Detection in Iterative Methods [J].
Chen, Zizhong .
ACM SIGPLAN NOTICES, 2013, 48 (08) :167-176
[10]  
Chen ZZ, 2011, HPDC 11: PROCEEDINGS OF THE 20TH INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE DISTRIBUTED COMPUTING, P73