Design and Modeling of a Non-blocking Checkpointing System

被引:0
作者
Sato, Kento [1 ]
Mohror, Kathryn [2 ]
Moody, Adam [2 ]
Gamblin, Todd [2 ]
de Supinski, Bronis R. [2 ]
Maruyama, Naoya [3 ]
Matsuoka, Satoshi [4 ]
机构
[1] Tokyo Inst Technol, Dep Math & Comp Sci, Meguro Ku, 2-12-1-W8-33 Ohokayama, Tokyo 1528552, Japan
[2] Ctr Appl Sci Comp, Lawrence Livermore Natl Lab, Livermore, CA 94551 USA
[3] RIKEN, Adv Inst Computat Sci, Chuo Ku, Kobe, Hyogo 6500047, Japan
[4] Tokyo Inst Technol, Global Sci Informat & Comp Ctr, Meguro Ku, Tokyo 1528552, Japan
来源
2012 INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC) | 2012年
关键词
Fault tolerance; Checkpoint/Restart; Markov model;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
As the capability and component count of systems increase, the MTBF decreases. Typically, applications tolerate failures with checkpoint/restart to a parallel file system (PFS). While simple, this approach can suffer from contention for PFS resources. Multi-level checkpointing is a promising solution. However, while multi-level checkpointing is successful on today's machines, it is not expected to be sufficient for exascale class machines, which are predicted to have orders of magnitude larger memory sizes and failure rates. Our solution combines the benefits of non-blocking and multi-level checkpointing. In this paper, we present the design of our system and model its performance. Our experiments show that our system can improve efficiency by 1.1 to 2.0 x on future machines. Additionally, applications using our checkpointing system can achieve high efficiency even when using a PFS with lower bandwidth.
引用
收藏
页数:10
相关论文
共 16 条
[1]  
Abbasi H, 2009, HPDC'09: 18TH ACM INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE DISTRIBUTED COMPUTING, P39
[2]  
Ali N., IMPROVING PERFORMANC, P218, DOI DOI 10.1109/HPDC.2006.1652153
[3]  
[Anonymous], 2010, P INT C HIGH PERF CO, DOI DOI 10.1109/SC.2010.18
[4]  
Bautista-Gomez L., 2011, P 2011 ACM IEEE INT
[5]  
Borrill Julian., Proceedings of SC07. Nov. 2007, P1, DOI [DOI 10.1145/1362622.1362636, 10.1145/1362622.1362636]
[6]  
Gropp W., 2004, LECT NOTES COMPUTER, V3241:7786
[7]  
Himeno Ryutaro., Himeno benchmark
[8]  
Liu N., 2012, MSST SNAPI APR
[9]  
Moody A., 2010, TECH REP
[10]  
Patrick Christina M., 2008, Operating Systems Review, V42, P43, DOI 10.1145/1453775.1453784