Fault tolerance in Message Passing Interface programs

被引:55
作者
Gropp, W [1 ]
Lusk, E [1 ]
机构
[1] Argonne Natl Lab, Math & Comp Sci Div, Argonne, IL 60439 USA
关键词
MPI; fault tolerance; process management; parallel computing;
D O I
10.1177/1094342004046045
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper we examine the topic of writing fault-tolerant Message Passing Interface (MPI) applications. We discuss the meaning of fault tolerance in general and what the MPI Standard has to say about it. We survey several approaches to this problem, namely checkpointing, restructuring a class of standard MPI programs, modifying MPI semantics, and extending the MPI specification. We conclude that, within certain constraints, MPI can provide a useful context for writing application programs that exhibit significant degrees of fault tolerance.
引用
收藏
页码:363 / 372
页数:10
相关论文
共 18 条
[1]  
[Anonymous], MPI MESS PASS INT ST
[2]  
[Anonymous], 1993, T PROCESSING
[3]  
BATCHU R, 2001, P 1 IEEE INT S CLUST
[4]  
BOSILCA G, 2002, P SC 2002 IEEE
[5]   Components and interfaces of a process management system for parallel programs [J].
Butler, R ;
Gropp, W ;
Lusk, E .
PARALLEL COMPUTING, 2001, 27 (11) :1417-1429
[6]   SPECIAL ISSUE - MPI - A MESSAGE-PASSING INTERFACE STANDARD [J].
DONGARRA, J ;
WALKER, D ;
LUSK, E ;
KNIGHTEN, B ;
SNIR, M ;
GEIST, A ;
OTTO, S ;
HEMPEL, R ;
LUSK, E ;
GROPP, W ;
COWNIE, J ;
SKJELLUM, T ;
CLARKE, L ;
LITTLEFIELD, R ;
SEARS, M ;
HUSSLEDERMAN, S ;
ANDERSON, E ;
BERRYMAN, S ;
FEENEY, J ;
FRYE, D ;
HART, L ;
HO, A ;
KOHL, J ;
MADAMS, P ;
MOSHER, C ;
PIERCE, P ;
SCHIKUTA, E ;
VOIGT, RG ;
BABB, R ;
BJORNSON, R ;
FERNANDO, V ;
GLENDINNING, I ;
HAUPT, T ;
HO, CTH ;
KRAUSS, S ;
MAINWARING, A ;
NESSETT, D ;
RANKA, S ;
SINGH, A ;
WEEKS, D ;
BARON, J ;
DOSS, N ;
FINEBERG, S ;
GREENBERG, A ;
HELLER, D ;
HOWELL, G ;
LEARY, B ;
MCBRYAN, O ;
PACHECO, P ;
RIGSBEE, P .
INTERNATIONAL JOURNAL OF SUPERCOMPUTER APPLICATIONS AND HIGH PERFORMANCE COMPUTING, 1994, 8 (3-4) :165-&
[7]  
Fagg GE, 2000, LECT NOTES COMPUT SC, V1908, P346
[8]   HARNESS and fault tolerant MPI [J].
Fagg, GE ;
Bukovsky, A ;
Dongarra, JJ .
PARALLEL COMPUTING, 2001, 27 (11) :1479-1495
[9]  
FAGG GE, 2004, IN PRESS INT J HIGH
[10]  
GEIST A, 2004, UNPUB J PARALLEL DIS