Fault-tolerant solutions for a MPI compute intensive application

被引:1
作者
Mourino, J. C. [1 ]
Martin, M. J. [2 ]
Gonzalez, P. [2 ]
Doallo, R. [2 ]
机构
[1] CESGA Supercomp Ctr Galicia, Avda Vigo S-N,Campus Sur, Santiago De Compostela 15705, Spain
[2] Univ A Coruna, Dept Elect & Syst, Fac Informat, La Coruna 15071, Spain
来源
15TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING, PROCEEDINGS | 2007年
关键词
D O I
10.1109/PDP.2007.44
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The running times of large-scale computational science and engineering parallel applications, executed on clusters or Grid platforms, are usually longer than the mean-time-between-failures (MTBF). Hardware failures must be tolerated by the parallel applications to ensure that no all computation done is lost on machine failures. Checkpointing and rollback recovery is a very useful technique to implement fault-tolerant applications. Although extensive research has been carried out in this field, there are few available tools to help parallel programmers to enhace with fault tolerant capability their applications. This work presents two different approaches to endow with fault tolerance the MPI version of an air quality simulation. A segment-level solution has been implemented by means of the extension of a checkpointing library for sequential codes. A variable-level solution has been implemented manually in the code. The main differences between both approaches are portability, transparency-level and checkpointing overheads. Experimental results comparing both strategies on a cluster of PCs are shown in the paper.
引用
收藏
页码:246 / +
页数:3
相关论文
共 17 条
[1]  
AGBARIA A, 1999, 8 IEEE INT S HIGH PE
[2]   An analysis of communication induced checkpointing [J].
Alvisi, L ;
Elnozahy, E ;
Rao, S ;
Husain, SA ;
De Mel, A .
TWENTY-NINTH ANNUAL INTERNATIONAL SYMPOSIUM ON FAULT-TOLERANT COMPUTING, DIGEST OF PAPERS, 1999, :242-249
[3]   Message logging: Pessimistic, optimistic, causal, and optimal [J].
Alvisi, L ;
Marzullo, K .
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 1998, 24 (02) :149-159
[4]  
Bosilca G., 2002, P INT C SUP
[5]  
BRONEVETSKY G, 2003, PRICIPLES PRACTICES
[6]   THE STEM-II REGIONAL SCALE ACID DEPOSITION AND PHOTOCHEMICAL OXIDANT MODEL .1. AND OVERVIEW OF MODEL DEVELOPMENT AND APPLICATIONS [J].
CARMICHAEL, GR ;
PETERS, LK ;
SAYLOR, RD .
ATMOSPHERIC ENVIRONMENT PART A-GENERAL TOPICS, 1991, 25 (10) :2077-2090
[7]   A survey of rollback-recovery protocols in message-passing systems [J].
Elnozahy, EN ;
Alvisi, L ;
Wang, YM ;
Johnson, DB .
ACM COMPUTING SURVEYS, 2002, 34 (03) :375-408
[8]  
FOLK M, 1999, P SUPERCOMPUTING SC9
[9]  
Louca S., 2000, Parallel Processing Letters, V10, P371, DOI 10.1142/S0129626400000342
[10]   High performance air pollution modeling for a power plant environment [J].
Martín, MJ ;
Singh, DE ;
Mouriño, JC ;
Rivera, FF ;
Doallo, R ;
Bruguera, JD .
PARALLEL COMPUTING, 2003, 29 (11-12) :1763-1790