Application-Level Differential Checkpointing for HPC Applications with Dynamic Datasets

被引:6
作者
Keller, Kai [1 ]
Gomez, Leonardo Bautista [1 ]
机构
[1] Barcelona Supercomp Ctr BSC CNS, Barcelona, Spain
来源
2019 19TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID) | 2019年
基金
欧盟地平线“2020”;
关键词
Fault Tolerance; Differential Checkpointing; Incremental Checkpointing; Multilevel Checkpoint;
D O I
10.1109/CCGRID.2019.00015
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
High-performance computing (HPC) requires resilience techniques such as checkpointing in order to tolerate failures in supercomputers. As the number of nodes and memory in supercomputers keeps on increasing, the size of checkpoint data also increases dramatically, sometimes causing an I/O bottleneck. Differential checkpointing (dCP) aims to minimize the checkpointing overhead by only writing data differences. This is typically implemented at the memory page level, sometimes complemented with hashing algorithms. However, such a technique is unable to cope with dynamic-size datasets. In this work, we present a novel dCP implementation with a new file format that allows fragmentation of protected datasets in order to support dynamic sizes. We identify dirty data blocks using hash algorithms. In order to evaluate the dCP performance, we ported the HPC applications xPic, LULESH 2.0 and Heat2D and analyze them regarding their potential of reducing I/O with dCP and how this data reduction influences the checkpoint performance. In our experiments, we achieve reductions of up to 62% of the checkpoint time.
引用
收藏
页码:52 / 61
页数:10
相关论文
共 23 条
  • [1] Differential serialization for optimized SOAP performance
    Abu-Ghazaleh, N
    Lewis, MJ
    Govindaraju, M
    [J]. 13TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE DISTRIBUTED COMPUTING, PROCEEDINGS, 2004, : 55 - 64
  • [2] Abu-Ghazaleh N., 2005, 6 IEEE ACM INT WORKS
  • [3] Abu-Ghazaleh N, 2006, ICWS 2006: IEEE INTERNATIONAL CONFERENCE ON WEB SERVICES, PROCEEDINGS, P11
  • [4] Adler M., 1995, ZLIB DATA COMPRESSIO
  • [5] [Anonymous], IPDPS
  • [6] [Anonymous], 2013, LLNLTR641973
  • [7] Bautista-Gomez L., SC 11
  • [8] Bautista-Gomez L., 2018, FTI FAULT TOLERANCE
  • [9] Communication-Sensitive Static Dataflow for Parallel Message Passing Applications
    Bronevetsky, Greg
    [J]. CGO 2009: INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION, PROCEEDINGS, 2009, : 1 - 12
  • [10] Chao Wang, 2010, Proceedings 2010 IEEE 16th International Conference on Parallel and Distributed Systems (ICPADS 2010), P524, DOI 10.1109/ICPADS.2010.48