Checkpoint/restart approaches for a thread-based MPI runtime

被引：7

作者：

Adam, Julien ^{[1
]}

Kermarquer, Maxime ^{[2
]}

Besnard, Jean-Baptiste ^{[1
]}

Bautista-Gomez, Leonardo ^{[3
]}

Perache, Marc ^{[2
]}

Carribault, Patrick ^{[2
]}

Jaeger, Julien ^{[2
]}

Malony, Allen D. ^{[4
]}

Shende, Sameer ^{[4
]}

机构：

[1] ParaTools SAS, Bruyeres Le Chatel, France

[2] CEA, DAM, DIF, F-91297 Arpajon, France

[3] Barcelona Supercomp Ctr, Barcelona, Spain

[4] ParaTools Inc, Eugene, OR USA

来源：

PARALLEL COMPUTING | 2019年 / 85卷

基金：

欧盟地平线“2020”;

关键词：

Checkpoint-restart; Fault-tolerance; DMTCP; Infiniband; Multilevel checkpointing; MPI oversubscribing;

D O I：

10.1016/j.parco.2019.02.006

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing units. Moreover, the larger jobs are, the more computing hours would be wasted by a crash. In this paper, we describe the work done in our MPI runtime to enable both transparent and application-level checkpointing mechanisms. Unlike the MPI 4.0 User-Level Failure Mitigation (ULFM) interface, our work targets solely Checkpoint/Restart and ignores other features such as resiliency. We show how existing checkpointing methods can be practically applied to a thread-based MPI implementation given sufficient runtime collaboration. The two main contributions are the preservation of high-speed network performance during transparent C/R and the over-subscription of checkpoint data replication thanks to a dedicated user-level scheduler support. These techniques are measured on MP1 benchmarks such as IMB, Lulesh and Heatdis, and associated overhead and trade-offs are discussed. (C) 2019 Elsevier B.V. All rights reserved.

引用

页码：204 / 219

页数：16

共 40 条

[31] Hiding Checkpoint Overhead in HPC Applications with a Semi-Blocking Algorithm [J].

Ni, Xiang ;

Meneses, Esteban ;

Kale, Laxmikant V. .

2012 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2012, :364-372

[32]

Ni Xiang, 2013, P INT C HIGH PERF CO, P7

[33]

Perache Marc, 2008, Euro-Par 2008 Parallel Processing. 14th International Euro-Par Conference, P78, DOI 10.1007/978-3-540-85451-7_9

[34]

Pérache M, 2009, LECT NOTES COMPUT SC, V5759, P94, DOI 10.1007/978-3-642-03770-2_16

[35] Diskless checkpointing [J].

Plank, JS ;

Li, K ;

Puening, MA .

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 1998, 9 (10) :972-986

[36]

Rieker M., 2006, PROC PDPTA 06, P492

[37]

Teranishi Keita, 2014, P 21 EUR MPI US GROU, P51

[38] Scheduling parallel jobs on multicore clusters using CPU oversubscription [J].

Utrera, Gladys ;

Corbalan, Julita ;

Labarta, Jesus .

JOURNAL OF SUPERCOMPUTING, 2014, 68 (03) :1113-1140

[39]

Wende Florian., 2015, Proceedings of the 3rd International Conference on Exascale Applications and Software, EASC'15, P13

[40] FTC-Charm++:: An in-memory checkpoint-based fault tolerant runtime for Charm plus plus and MPI [J].

Zheng, GB ;

Shi, LX ;

Kalé, LV .

2004 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, 2004, :93-103

← 1 2 3 4 →