Checkpoint/restart approaches for a thread-based MPI runtime

被引:7
作者
Adam, Julien [1 ]
Kermarquer, Maxime [2 ]
Besnard, Jean-Baptiste [1 ]
Bautista-Gomez, Leonardo [3 ]
Perache, Marc [2 ]
Carribault, Patrick [2 ]
Jaeger, Julien [2 ]
Malony, Allen D. [4 ]
Shende, Sameer [4 ]
机构
[1] ParaTools SAS, Bruyeres Le Chatel, France
[2] CEA, DAM, DIF, F-91297 Arpajon, France
[3] Barcelona Supercomp Ctr, Barcelona, Spain
[4] ParaTools Inc, Eugene, OR USA
基金
欧盟地平线“2020”;
关键词
Checkpoint-restart; Fault-tolerance; DMTCP; Infiniband; Multilevel checkpointing; MPI oversubscribing;
D O I
10.1016/j.parco.2019.02.006
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing units. Moreover, the larger jobs are, the more computing hours would be wasted by a crash. In this paper, we describe the work done in our MPI runtime to enable both transparent and application-level checkpointing mechanisms. Unlike the MPI 4.0 User-Level Failure Mitigation (ULFM) interface, our work targets solely Checkpoint/Restart and ignores other features such as resiliency. We show how existing checkpointing methods can be practically applied to a thread-based MPI implementation given sufficient runtime collaboration. The two main contributions are the preservation of high-speed network performance during transparent C/R and the over-subscription of checkpoint data replication thanks to a dedicated user-level scheduler support. These techniques are measured on MP1 benchmarks such as IMB, Lulesh and Heatdis, and associated overhead and trade-offs are discussed. (C) 2019 Elsevier B.V. All rights reserved.
引用
收藏
页码:204 / 219
页数:16
相关论文
共 40 条
[1]   Transparent High-Speed Network Checkpoint/Restart in MPI [J].
Adam, Julien ;
Besnard, Jean-Baptiste ;
Malony, Allen D. ;
Shende, Sameer ;
Perache, Marc ;
Carribault, Patrick ;
Jaeger, Julien .
EUROMPI 2018: PROCEEDINGS OF THE 25TH EUROPEAN MPI USERS' GROUP MEETING, 2018,
[2]  
[Anonymous], 2012, EUR MPI US GROUP M
[3]  
[Anonymous], 2005, USENIX ANN TECHN C F
[4]  
Ansel J, 2009, INT PARALL DISTRIB P, P895
[5]  
Barrett B. W., 2012, SAND201812790 SAND N
[6]  
Bautista-Gomez L., P 2011 INT C HIGH PE, P1, DOI DOI 10.1145/2063384.2063427
[7]   Introducing Task-Containers as an Alternative to Runtime-Stacking [J].
Besnard, Jean-Baptiste ;
Adam, Julien ;
Shende, Sameer ;
Perache, Marc ;
Carribault, Patrick ;
Jaeger, Julien .
PROCEEDINGS OF THE 23RD EUROPEAN MPI USERS' GROUP MEETING (EUROMPI 2016), 2016, :51-63
[8]   Post-failure recovery of MPI communication capability: Design and rationale [J].
Bland, Wesley ;
Bouteiller, Aurelien ;
Herault, Thomas ;
Bosilca, George ;
Dongarra, Jack .
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2013, 27 (03) :244-254
[9]  
Bouteiller A., 2015, EUROMPI 15, DOI 10.1145/2802658.2802668
[10]   Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI protocols [J].
Buntinas, Darius ;
Coti, Camille ;
Herault, Thomas ;
Lemarinier, Pierre ;
Pilard, Laurence ;
Rezmerita, Ala ;
Rodriguez, Eric ;
Cappello, Franck .
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2008, 24 (01) :73-84