Hybrid Checkpointing Using Emerging Nonvolatile Memories for Future Exascale Systems

被引:44
作者
Dong, Xiangyu [1 ]
Xie, Yuan [1 ]
Muralimanohar, Naveen
Jouppi, Norman P.
机构
[1] Penn State Univ, Comp Sci & Engn Dept, University Pk, PA 16802 USA
基金
美国国家科学基金会;
关键词
Design; Performance; Reliability; Checkpoint; petascale; exascale; phase-change memory; optimum checkpoint model; hybrid checkpoint; in-memory checkpoint; in-disk checkpoint; incremental checkpoint; background checkpoint; checkpoint prototype; CHALLENGES;
D O I
10.1145/1970386.1970387
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The scalability of future Massively Parallel Processing (MPP) systems is being severely challenged by high failure rates. Current centralized Hard Disk Drive (HDD) checkpointing results in overhead of 25% or more at petascale. Since systems become more vulnerable as the node count keeps increasing, novel techniques that enable fast and frequent checkpointing are critical to the future exascale system implementation. In this work, we first introduce one of the emerging nonvolatile memory technologies, Phase-Change Random Access Memory (PCRAM), as a proper candidate of the fast checkpointing device. After a thorough analysis of MPP systems, failure rates and failure sources, we propose a PCRAM-based hybrid local/global checkpointing mechanism which not only provides a faster checkpoint storage, but also boosts the effectiveness of other orthogonal techniques such as incremental checkpointing and background checkpointing. Three variant implementations of the PCRAM-based hybrid checkpointing are designed to be adopted at different stages and to offer a smooth transition from the conventional in-disk checkpointing to the instant in-memory approach. Analyzing the overhead by using a hybrid checkpointing performance model, we show the proposed approach only incurs less than 3% performance overhead on a projected exascale system.
引用
收藏
页数:29
相关论文
共 40 条
  • [1] [Anonymous], 2002, 2002 ACMIEEE C SUPER, P1, DOI DOI 10.1109/SC.2002.10017
  • [2] A Bipolar-Selected Phase Change Memory Featuring Multi-Level Cell Storage
    Bedeschi, Ferdinando
    Fackenthal, Rich
    Resta, Claudio
    Donze, Enzo Michele
    Jagasivamani, Meenatchi
    Buda, Egidio Cassiodoro
    Pellizzer, Fabio
    Chow, David W.
    Cabrini, Alessandro
    Calvi, Giacomo Matteo Angelo
    Faravelli, Roberto
    Fantini, Andrea
    Torelli, Guido
    Mills, Duane
    Gastaldi, Roberto
    Casagrande, Giulio
    [J]. IEEE JOURNAL OF SOLID-STATE CIRCUITS, 2009, 44 (01) : 217 - 227
  • [3] Designing reliable systems from unreliable components: The challenges of transistor variability and degradation
    Borkar, S
    [J]. IEEE MICRO, 2005, 25 (06) : 10 - 16
  • [4] Bronevetsky G, 2009, LLNLTR415791
  • [5] Compiler-Enhanced Incremental Checkpointing for OpenMP Applications
    Bronevetsky, Greg
    Marques, Daniel
    Pingali, Keshav
    Rugina, Radu
    McKee, Sally A.
    [J]. PPOPP'08: PROCEEDINGS OF THE 2008 ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING, 2008, : 275 - 276
  • [6] FAULT TOLERANCE IN PETASCALE/EXASCALE SYSTEMS: CURRENT KNOWLEDGE, CHALLENGES AND RESEARCH OPPORTUNITIES
    Cappello, Franck
    [J]. INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2009, 23 (03) : 212 - 226
  • [7] DISTRIBUTED SNAPSHOTS - DETERMINING GLOBAL STATES OF DISTRIBUTED SYSTEMS
    CHANDY, KM
    LAMPORT, L
    [J]. ACM TRANSACTIONS ON COMPUTER SYSTEMS, 1985, 3 (01): : 63 - 75
  • [8] Evaluation of checkpoint mechanisms for massively parallel machines
    Chiueh, T
    Deng, PT
    [J]. PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL SYMPOSIUM ON FAULT-TOLERANT COMPUTING, 1996, : 370 - 379
  • [9] A higher order estimate of the optimum checkpoint interval for restart dumps
    Daly, JT
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF GRID COMPUTING THEORY METHODS AND APPLICATIONS, 2006, 22 (03): : 303 - 312
  • [10] Dong Xiangyu., 2009, SC 09 P C HIGH PERFO, P1, DOI DOI 10.1145/1654059.1654117