Improving Scalability of Application-Level Checkpoint-Recovery by Reducing Checkpoint Sizes

被引：21

作者：

Cores, Ivan ^{[1
]}

Rodriguez, Gabriel ^{[1
]}

Martin, Maria J. ^{[1
]}

Gonzalez, Patricia ^{[1
]}

Osorio, Roberto R. ^{[1
]}

机构：

[1] Univ A Coruna, Comp Architecture Grp, La Coruna, Spain

来源：

NEW GENERATION COMPUTING | 2013年 / 31卷 / 03期

关键词：

Parallel Programming; Message-Passing; MPI; Fault Tolerance; Checkpointing; FAULT-TOLERANCE; CPPC;

D O I：

10.1007/s00354-013-0302-4

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The execution times of large-scale parallel applications on nowadays multi/many-core systems are usually longer than the mean time between failures. Therefore, parallel applications must tolerate hardware failures to ensure that not all computation done is lost on machine failures. Checkpointing and rollback recovery is one of the most popular techniques to implement fault-tolerant applications. However, checkpointing parallel applications is expensive in terms of computing time, network utilization and storage resources. Thus, current checkpoint-recovery techniques should minimize these costs in order to be useful for large scale systems. In this paper three different and complementary techniques to reduce the size of the checkpoints generated by application-level checkpointing are proposed and implemented. Detailed experimental results obtained on a multicore cluster show the effectiveness of the proposed methods to reduce checkpointing cost.

引用

页码：163 / 185

页数：23

共 43 条

[1]

Agarwal S., 2004, Proceedings of the 18th annual international conference on supercomputing, ICS '04, P277, DOI [10.1145/1006209.1006248, DOI 10.1145/1006209.1006248]

[2]

[Anonymous], NAS PAR BENCHM

[3]

[Anonymous], 2011, P 2011 INT C HIGH PE

[4]

[Anonymous], 2010, P INT C HIGH PERF CO, DOI DOI 10.1109/SC.2010.18

[5]

Bautista Gomez Leonardo Arturo, 2010, Proceedings 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGrid), P63, DOI 10.1109/CCGRID.2010.40

[6] Algorithm-based fault tolerance applied to high performance computing [J].

Bosilca, George ;

Delmas, Remi ;

Dongarra, Jack ;

Langou, Julien .

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2009, 69 (04) :410-416

[7] FAULT TOLERANCE IN PETASCALE/EXASCALE SYSTEMS: CURRENT KNOWLEDGE, CHALLENGES AND RESEARCH OPPORTUNITIES [J].

Cappello, Franck .

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2009, 23 (03) :212-226

[8]

Chao Wang, 2010, Proceedings 2010 IEEE 16th International Conference on Parallel and Distributed Systems (ICPADS 2010), P524, DOI 10.1109/ICPADS.2010.48

[9]

Chen Zizhong., 2005, P 10 ACM SIGPLAN S P, P213, DOI DOI 10.1145/1065944.1065973

[10] A New Diskless Checkpointing Approach for Multiple Processor Failures [J].

Chiu, Ge-Ming ;

Chiu, Jane-Ferng .

IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2011, 8 (04) :481-493

← 1 2 3 4 5 →