A System Software Approach to Proactive Memory-Error Avoidance

被引:30
作者
Costa, Carlos H. A. [1 ]
Park, Yoonho [1 ]
Rosenburg, Bryan S. [1 ]
Cher, Chen-Yong [1 ]
Ryu, Kyung Dong [1 ]
机构
[1] IBM TJ Watson Res Ctr, Yorktown Hts, NY 10598 USA
来源
SC14: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS | 2014年
关键词
Operating Systems; Memory Structures; Reliability; and Fault-Tolerance;
D O I
10.1109/SC.2014.63
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Today's HPC systems use two mechanisms to address main-memory errors. Error-correcting codes make correctable errors transparent to software, while checkpoint/restart (CR) enables recovery from uncorrectable errors. Unfortunately, CR overhead will be enormous at exascale due to the high failure rate of memory. We propose a new OS-based approach that proactively avoids memory errors using prediction. This scheme exposes correctable error information to the OS, which migrates pages and offlines unhealthy memory to avoid application crashes. We analyze memory error patterns in extensive logs from a BG/P system and show how correctable error patterns can be used to identify memory likely to fail. We implement a proactive memory management system on BG/Q by extending the firmware and Linux. We evaluate our approach with a realistic workload and compare our overhead against CR. We show improved resilience with negligible performance overhead for applications.
引用
收藏
页码:707 / 718
页数:12
相关论文
共 24 条
[1]  
Almasi G, 2008, IBM J RES DEV, V52, P199
[2]  
[Anonymous], INT C SUP COMP JUN
[3]  
[Anonymous], INT C ARCH SUPP PROG
[4]  
[Anonymous], INT C DEP SYST NETW
[5]  
[Anonymous], 2010, P INT C HIGH PERF CO, DOI DOI 10.1109/SC.2010.18
[6]  
[Anonymous], 2012, PROC IEEE INT C HIGH
[7]   TOWARD EXASCALE RESILIENCE [J].
Cappello, Franck ;
Geist, Al ;
Gropp, Bill ;
Kale, Laxmikant ;
Kramer, Bill ;
Snir, Marc .
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2009, 23 (04) :374-388
[8]  
Chakravorty S., 2007, 2007 IEEE INT PARALL, P1
[9]  
Daly J, 2003, LECT NOTES COMPUT SC, V2660, P3
[10]   A higher order estimate of the optimum checkpoint interval for restart dumps [J].
Daly, JT .
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF GRID COMPUTING THEORY METHODS AND APPLICATIONS, 2006, 22 (03) :303-312