VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale

被引:65
作者
Nicolae, Bogdan [1 ]
Moody, Adam [2 ]
Gonsiorowski, Elsa [2 ]
Mohror, Kathryn [2 ]
Cappello, Franck [1 ]
机构
[1] Argonne Natl Lab, 9700 S Cass Ave, Argonne, IL 60439 USA
[2] Lawrence Livermore Natl Lab, Livermore, CA 94550 USA
来源
2019 IEEE 33RD INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2019) | 2019年
关键词
parallel I/O; checkpoint-restart; immutable data; adaptive multilevel asynchronous I/O;
D O I
10.1109/IPDPS.2019.00099
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Global checkpointing to external storage (e.g., a parallel file system) is a common I/O pattern of many HPC applications. However, given the limited I/O throughput of external storage, global checkpointing can often lead to I/O bottlenecks. To address this issue, a shift from synchronous checkpointing (i.e., blocking until writes have finished) to asynchronous checkpointing (i.e., writing to faster local storage and flushing to external storage in the background) is increasingly being adopted. However, with rising core count per node and heterogeneity of both local and external storage, it is non-trivial to design efficient asynchronous checkpointing mechanisms due to the complex interplay between high concurrency and I/O performance variability at both the node-local and global levels. This problem is not well understood but highly important for modern supercomputing infrastructures. This paper proposes a versatile asynchronous checkpointing solution that addresses this problem. To this end, we introduce a concurrency-optimized technique that combines performance modeling with lightweight monitoring to make informed decisions about what local storage devices to use in order to dynamically adapt to background flushes and reduce the checkpointing overhead. We illustrate this technique using the VeloC prototype. Extensive experiments on a pre-Exascale supercomputing system show significant benefits.
引用
收藏
页码:911 / 920
页数:10
相关论文
共 23 条
[1]  
[Anonymous], 2004, C++ Template Metaprogramming: Concepts, Tools, and Techniques from Boost and Beyond
[2]  
[Anonymous], 2004, P 18 ANN INT C SUPER, DOI [10.1145/1006209.1006248, DOI 10.1145/1006209.1006248]
[3]  
Bautista-Gomez L., P 2011 INT C HIGH PE, P1, DOI DOI 10.1145/2063384.2063427
[4]  
Cao L, 2017, HPC 17
[5]   Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, Jitter-free I/O [J].
Dorier, Matthieu ;
Antoniu, Gabriel ;
Cappello, Franck ;
Snir, Marc ;
Orf, Leigh .
2012 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2012, :155-163
[6]  
Gioiosa R., 2005, SC 05
[7]   HACC: Extreme Scaling and Performance Across Diverse Architectures [J].
Habib, Salman ;
Morozov, Vitali ;
Frontiere, Nicholas ;
Finkel, Hal ;
Pope, Adrian ;
Heitmann, Katrin ;
Kumaran, Kalyan ;
Vishwanath, Venkatram ;
Peterka, Tom ;
Insley, Joe ;
Daniel, David ;
Fasel, Patricia ;
Lukic, Zarija .
COMMUNICATIONS OF THE ACM, 2017, 60 (01) :97-104
[8]   HACC: Simulating sky surveys on state-of-the-art supercomputing architectures [J].
Habib, Salman ;
Pope, Adrian ;
Finkel, Hal ;
Frontiere, Nicholas ;
Heitmann, Katrin ;
Daniel, David ;
Fasel, Patricia ;
Morozov, Vitali ;
Zagaris, George ;
Peterka, Tom ;
Vishwanath, Venkatram ;
Lukic, Zarija ;
Sehrish, Saba ;
Liao, Wei-keng .
NEW ASTRONOMY, 2016, 42 :49-65
[9]   Optimizing I/O forwarding techniques for extreme-scale event tracing [J].
Ilsche, Thomas ;
Schuchart, Joseph ;
Cope, Jason ;
Kimpe, Dries ;
Jones, Terry ;
Knuepfer, Andreas ;
Iskra, Kamil ;
Ross, Robert ;
Nagel, Wolfgang E. ;
Poole, Stephen .
CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2014, 17 (01) :1-18
[10]   Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture [J].
Islam, Nusrat Sharmin ;
Lu, Xiaoyi ;
Wasi-ur-Rahman, Md. ;
Shankar, Dipti ;
Panda, Dhabaleswar K. .
2015 15TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING, 2015, :101-110