A Flexible Checkpoint/Restart Model in Distributed Systems

被引：0

作者：

Bouguerra, Mohamed-Slim ^{[1
,2
]}

Gautier, Thierry ^{[2
]}

Trystram, Denis ^{[1
]}

Vincent, Jean-Marc ^{[1
]}

机构：

[1] Grenoble Univ, ZIRST 51, Ave Jean Kuntzmann, F-38330 Montbonnot St Martin, St Martin, France

[2] INRIA Rhone Alpes, F-38334 Saint Ismier, France

来源：

PARALLEL PROCESSING AND APPLIED MATHEMATICS, PT I | 2010年 / 6067卷

关键词：

Fault tolerance; Reliability modeling; Checkpointing; INTERVAL; FAILURES; SCALE;

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Large scale applications running on new computing platforms with thousands of processors have to face with reliability problems. The failure of a single processor will cause the entire execution to fail. Most existing approaches to guarantee reliable executions are based on fault tolerance mechanisms. Coordinated checkpointing is one of the most popular technique to deal with failures in such platforms. This work presents a new model of coordinated Checkpoint/Restart mechanism for several types of computing platforms. The model is parametrized by the process failure distribution, the cost to save a global consistent state of Processes and the number of computational resources. Through mathematical analysis of reliability, we apply this new model to compute the optimal interval between checkpoint times in order to minimize the average completion time. Model independency from the type of the failure law makes it completely flexible. We show that such a model may be used to reduce the checkpoint rate up to 20% in same cases and up to factor 4 the total overhead in same cases. Finally, we report some experiments based on simulations for random failure distributions corresponding to the two most popular laws, namely, the Poisson's process and Weibull's law.

引用

页码：206 / +

页数：2

共 14 条

[1]

ADIGA NR, 2002, ACM IEEE C, P60

[2]

BOUGUERRA MS, 2008, RR6751 INRIA

[3] DISTRIBUTED SNAPSHOTS - DETERMINING GLOBAL STATES OF DISTRIBUTED SYSTEMS [J].

CHANDY, KM ;

LAMPORT, L .

ACM TRANSACTIONS ON COMPUTER SYSTEMS, 1985, 3 (01) :63-75

[4] A higher order estimate of the optimum checkpoint interval for restart dumps [J].

Daly, JT .

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF GRID COMPUTING THEORY METHODS AND APPLICATIONS, 2006, 22 (03) :303-312

[5] Checkpointing for Peta-scale systems: A look into the future of practical rollback-recovery [J].

Elnozahy, EN ;

Plank, JS .

IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2004, 1 (02) :97-108

[6] SELECTION OF A CHECKPOINT INTERVAL IN A CRITICAL-TASK ENVIRONMENT [J].

GEIST, R ;

REYNOLDS, R ;

WESTALL, J .

IEEE TRANSACTIONS ON RELIABILITY, 1988, 37 (04) :395-400

[7] An analysis of clustered failures on large supercomputing systems [J].

Hacker, Thomas J. ;

Romero, Fabian ;

Carothers, Christopher D. .

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2009, 69 (07) :652-665

[8] An optimal checkpoint/restart model for a large scale High Performance Computing system [J].

Liu, Yudan ;

Nassar, Raja ;

Leangsuksun, Chokchai ;

Naksinehaboon, Nichanion ;

Paun, Mihaela ;

Scott, Stephen L. .

2008 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-8, 2008, :1491-+

[9]

Naksinehaboon N, 2008, IEEE ACM INT SYMP, P783, DOI 10.1109/CCGRID.2008.109

[10]

Oliner A.J., 2006, Proceedings of the 20th annual international conference on Supercomputing, ICS '06, P14, DOI [10.1145/1183401.1183406, DOI 10.1145/1183401.1183406]

← 1 2 →