High Performance Computing Systems with Various Checkpointing Schemes

被引：9

作者：

Naksinehaboon, N. ^{[1
]}

Paun, M. ^{[2
,3
]}

Nassar, R. ^{[2
]}

Leangsuksun, B. ^{[1
]}

Scott, S. ^{[4
]}

机构：

[1] Louisiana Tech Univ, Dept Comp Sci, Ruston, LA 71272 USA

[2] Louisiana Tech Univ, Dept Math & Stat, Ruston, LA 71272 USA

[3] Spiru Haret Univ, Finance & Banks Fac, Bucharest, Romania

[4] Oak Ridge Natl Lab, Comp Sci & Math Div, Oak Ridge, TN 37831 USA

来源：

INTERNATIONAL JOURNAL OF COMPUTERS COMMUNICATIONS & CONTROL | 2009年 / 4卷 / 04期

基金：

美国国家科学基金会;

关键词：

Large-scale distributed system; reliability; fault-tolerance; checkpoint/restart model; HPC; INTERVAL;

D O I：

10.15837/ijccc.2009.4.2455

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Finding the failure rate of a system is a crucial step in high performance computing systems analysis. To deal with this problem, a fault tolerant mechanism, called checkpoint/restart technique, was introduced. However, there are additional costs to perform this mechanism. Thus, we propose two models for different schemes (full and incremental checkpoint schemes). The models which are based on the reliability of the system are used to determine the checkpoint placements. Both proposed models consider a balance of between checkpoint overhead and the re-computing time. Due to the extra costs from each incremental checkpoint during the recovery period, a method to find the number of incremental checkpoints between two consecutive full checkpoints is given. Our simulation suggests that in most cases our incremental checkpoint model can reduce the waste time more than it is reduced by the full checkpoint model. The waste times produced by both models are in the range Of 2% to 28% of the application completion time depending on the checkpoint overheads.

引用

页码：386 / 400

页数：15

共 22 条

[1]

ADIGA AR, 2002, P SUP IEEE ACM C, P60

[2]

Daly J, 2003, LECT NOTES COMPUT SC, V2660, P3

[3]

DALY JT, 2004, FUTURE GENERATION CO

[4] Checkpointing for Peta-scale systems: A look into the future of practical rollback-recovery [J].

Elnozahy, EN ;

Plank, JS .

IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2004, 1 (02) :97-108

[5] SELECTION OF A CHECKPOINT INTERVAL IN A CRITICAL-TASK ENVIRONMENT [J].

GEIST, R ;

REYNOLDS, R ;

WESTALL, J .

IEEE TRANSACTIONS ON RELIABILITY, 1988, 37 (04) :395-400

[6]

Gelfand I. M., 2000, Calculus of Variations

[7]

Hâncu L, 2008, INT J COMPUT COMMUN, V3, P322

[8]

Hunyadi DI, 2008, INT J COMPUT COMMUN, V3, P327

[9] A variational calculus approach to optimal checkpoint placement [J].

Ling, YB ;

Mi, J ;

Lin, XL .

IEEE TRANSACTIONS ON COMPUTERS, 2001, 50 (07) :699-708

[10]

LIU Y, 2008, P INT PAR DISTR PROC, P1

← 1 2 3 →