MATCH: An MPI Fault Tolerance Benchmark Suite

被引：5

作者：

Guo, Luanzheng ^{[1
]}

Georgakoudis, Giorgis ^{[2
]}

Parasyris, Konstantinos ^{[2
]}

Laguna, Ignacio ^{[2
]}

Li, Dong ^{[1
]}

机构：

[1] Univ Calif Merced, EECS, Merced, CA 95343 USA

[2] Lawrence Livermore Natl Lab, CASC, Livermore, CA 94550 USA

来源：

2020 IEEE INTERNATIONAL SYMPOSIUM ON WORKLOAD CHARACTERIZATION (IISWC 2020) | 2020年

基金：

美国国家科学基金会;

关键词：

FRAMEWORK;

D O I：

10.1109/IISWC50251.2020.00015

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

MPI has been ubiquitously deployed in flagship HPC systems aiming to accelerate distributed scientific applications running on tens of hundreds of processes and compute nodes. Maintaining the correctness and integrity of MPI application execution is critical, especially for safety-critical scientific applications. Therefore, a collection of effective MPI fault tolerance techniques have been proposed to enable MPI application execution to efficiently resume from system failures. However, there is no structured way to study and compare different MPI fault tolerance designs, so to guide the selection and development of efficient MPI fault tolerance techniques for distinct scenarios. To solve this problem, we design, develop, and evaluate a benchmark suite called MATCH to characterize, research, and comprehensively compare different combinations and configurations of MPI fault tolerance designs. Our investigation derives useful findings: (1) Reinit recovery in general performs better than ULFM recovery; (2) Reinit recovery is independent of the scaling size and the input problem size, whereas ULFM recovery is not; (3) Using Reinit recovery with FTI checkpointing is a highly efficient fault tolerance design.

引用

页码：60 / 71

页数：12

共 39 条

[1] Transparent High-Speed Network Checkpoint/Restart in MPI [J].

Adam, Julien ;

Besnard, Jean-Baptiste ;

Malony, Allen D. ;

Shende, Sameer ;

Perache, Marc ;

Carribault, Patrick ;

Jaeger, Julien .

EUROMPI 2018: PROCEEDINGS OF THE 25TH EUROPEAN MPI USERS' GROUP MEETING, 2018,

[2] Checkpoint/restart approaches for a thread-based MPI runtime [J].

Adam, Julien ;

Kermarquer, Maxime ;

Besnard, Jean-Baptiste ;

Bautista-Gomez, Leonardo ;

Perache, Marc ;

Carribault, Patrick ;

Jaeger, Julien ;

Malony, Allen D. ;

Shende, Sameer .

PARALLEL COMPUTING, 2019, 85 :204-219

[3] Design of a Hybrid MPI-CUDA Benchmark Suite for CPU-GPU Clusters [J].

Agarwal, Tejaswi ;

Becchi, Michela .

PROCEEDINGS OF THE 23RD INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT'14), 2014, :505-506

[4] Complex scientific applications made fault-tolerant with the sparse grid combination technique [J].

Ali, Md Mohsin ;

Strazdins, Peter E. ;

Harding, Brendan ;

Hegland, Markus .

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2016, 30 (03) :335-359

[5]

Ameer N., 2009, INT WORKSH OPENMP

[6]

[Anonymous], 2014, P 21 EUR MPI US GROU, DOI DOI 10.1145/2642769.2642775

[7]

[Anonymous], 2006, P SPEC BENCHM WORKSH

[8]

Bautista-Gomez L., 2011, SC 11

[9]

Bautista-Gomez Leonardo, 2011, INT C HIGH PERF COMP

[10] Lessons Learned Implementing User-Level Failure Mitigation in MPICH [J].

Bland, Wesley ;

Lu, Huiwei ;

Seo, Sangmin ;

Balaji, Pavan .

2015 15TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING, 2015, :1123-1126

← 1 2 3 4 →