Rethinking Algorithm-Based Fault Tolerance with a Cooperative Software-Hardware Approach

被引:17
作者
Li, Dong [1 ]
Chen, Zizhong [2 ]
Wu, Panruo [2 ]
Vetter, Jeffrey S. [1 ,3 ]
机构
[1] Oak Ridge Natl Lab, Oak Ridge, TN 37831 USA
[2] Univ Calif Riverside, Riverside, CA 92521 USA
[3] Georgia Inst Technol, Atlanta, GA 30332 USA
来源
2013 INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC) | 2013年
关键词
algorithm-based fault tolerance; error-correcting code; adaptive resilience;
D O I
10.1145/2503210.2503226
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Algorithm-based fault tolerance (ABFT) is a highly efficient resilience solution for many widely-used scientific computing kernels. However, in the context of the resilience ecosystem, ABFT is completely opaque to any underlying hardware resilience mechanisms. As a result, some data structures are over-protected by ABFT and hardware, which leads to redundant costs in terms of performance and energy. In this paper, we rethink ABFT using an integrated view including both software and hardware with the goal of improving performance and energy efficiency of ABFT-enabled applications. In particular, we study how to coordinate ABFT and error-correcting code (ECC) for main memory, and investigate the impact of this coordination on performance, energy, and resilience for ABFT-enabled applications. Scaling tests and analysis indicate that our approach saves up to 25% for system energy (and up to 40% for dynamic memory energy) with up to 18% performance improvement over traditional approaches of ABFT with ECC.
引用
收藏
页数:12
相关论文
共 40 条
[1]  
Ahn J. H., 2009, INT C HIGH PERF COMP
[2]  
[Anonymous], 2008, TECHNICAL REPORT
[3]  
[Anonymous], 2007, BIOS KERN DEV GUID A
[4]  
[Anonymous], 2010, ASPLOS
[5]  
[Anonymous], 2009, SIGMETRICS
[6]  
Baker A. H., 2011, INT C COMP HPC
[7]  
CHEN T, 2007, IBM J RES DEV
[8]  
Chen Z., 2011, INT S HIGH PERF PAR
[9]  
Chen Z., 2013, ACM SIGPLAN S PRINC
[10]  
Davies T., 2013, S HIGH PERF PAR DIST