Hybrid Checkpointing Using Emerging Nonvolatile Memories for Future Exascale Systems

被引:45
作者
Dong, Xiangyu [1 ]
Xie, Yuan [1 ]
Muralimanohar, Naveen
Jouppi, Norman P.
机构
[1] Penn State Univ, Comp Sci & Engn Dept, University Pk, PA 16802 USA
基金
美国国家科学基金会;
关键词
Design; Performance; Reliability; Checkpoint; petascale; exascale; phase-change memory; optimum checkpoint model; hybrid checkpoint; in-memory checkpoint; in-disk checkpoint; incremental checkpoint; background checkpoint; checkpoint prototype; CHALLENGES;
D O I
10.1145/1970386.1970387
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The scalability of future Massively Parallel Processing (MPP) systems is being severely challenged by high failure rates. Current centralized Hard Disk Drive (HDD) checkpointing results in overhead of 25% or more at petascale. Since systems become more vulnerable as the node count keeps increasing, novel techniques that enable fast and frequent checkpointing are critical to the future exascale system implementation. In this work, we first introduce one of the emerging nonvolatile memory technologies, Phase-Change Random Access Memory (PCRAM), as a proper candidate of the fast checkpointing device. After a thorough analysis of MPP systems, failure rates and failure sources, we propose a PCRAM-based hybrid local/global checkpointing mechanism which not only provides a faster checkpoint storage, but also boosts the effectiveness of other orthogonal techniques such as incremental checkpointing and background checkpointing. Three variant implementations of the PCRAM-based hybrid checkpointing are designed to be adopted at different stages and to offer a smooth transition from the conventional in-disk checkpointing to the instant in-memory approach. Analyzing the overhead by using a hybrid checkpointing performance model, we show the proposed approach only incurs less than 3% performance overhead on a projected exascale system.
引用
收藏
页数:29
相关论文
共 40 条
[21]  
*NASA, 2009, NAS PAR BENCHM
[22]   Modeling the impact of checkpoints on next-generation systems [J].
Oldfield, Ron A. ;
Arunagiri, Sarala ;
Teller, Patricia J. ;
Seelam, Seetharami ;
Varela, Maria Ruiz ;
Riesen, Rolf ;
Roth, Philip C. .
24TH IEEE CONFERENCE ON MASS STORAGE SYSTEMS AND TECHNOLOGIES, PROCEEDINGS, 2007, :30-+
[23]  
Oliner A.J., 2006, Proceedings of the 20th annual international conference on Supercomputing, ICS '06, P14, DOI [10.1145/1183401.1183406, DOI 10.1145/1183401.1183406]
[24]  
Pellizzer F, 2004, 2004 SYMPOSIUM ON VLSI TECHNOLOGY, DIGEST OF TECHNICAL PAPERS, P18
[25]  
PIROVANO A, 2003, P IEEE INT EL DEV M
[26]   Diskless checkpointing [J].
Plank, JS ;
Li, K ;
Puening, MA .
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 1998, 9 (10) :972-986
[27]  
Plank JS, 1999, SOFTWARE PRACT EXPER, V29, P125, DOI 10.1002/(SICI)1097-024X(199902)29:2<125::AID-SPE224>3.0.CO
[28]  
2-7
[29]  
Reed Dan, 2004, DIR C
[30]  
Sancho J. C., 2004, Proceedings. 18th International Parallel and Distributed Processing Symposium