Automated application-level checkpointing of MPI programs

被引:45
作者
Bronevetsky, G [1 ]
Marques, D [1 ]
Pingali, K [1 ]
Stodghill, P [1 ]
机构
[1] Cornell Univ, Dept Comp Sci, Ithaca, NY 14853 USA
关键词
algorithms; reliability; languages; theory; fault-tolerance; application-level checkpointing; MPI; scientific computing; non-FIFO communication;
D O I
10.1145/966049.781513
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The running times of many computational science applications, such as protein-folding using ab initio methods, are much longer than the mean-time-to-failure of high-performance computing platforms. To run to completion, therefore, these applications must tolerate hardware failures. In this paper, we focus on the stopping failure model in which a faulty process hangs and stops responding to the rest of the system. We argue that tolerating such faults is best done by an approach called application-level coordinated non-blocking checkpointing, and that existing fault-tolerance protocols in the literature are not suitable for implementing this approach. We then present a suitable protocol, which is implemented by a co-ordination layer that sits between the application program and the MPI library. We show how this protocol can be used with a precompiler that instruments C/MPI programs to save application and MPI library state. An advantage of our approach is that it is independent of the MPI implementation. We present experimental results that argue that the overhead of using our system can be small.
引用
收藏
页码:84 / 94
页数:11
相关论文
共 21 条
[1]  
AGBARIA A, 1999, 8 IEEE INT S HIGH PE
[2]  
[Anonymous], PODC 01 P ANN ACM S
[3]  
BECK M, 1994, UTCS94269
[4]   Application level fault tolerance in heterogeneous networks of workstations [J].
Beguelin, A ;
Seligman, E ;
Stephan, P .
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1997, 43 (02) :147-155
[5]  
*BLUEG L TEAM, 2002, SC 2000 HIGH PERF NE
[6]  
BRONEVETSKY G, 2003, INT C SUP ICS 2003 S
[7]   DISTRIBUTED SNAPSHOTS - DETERMINING GLOBAL STATES OF DISTRIBUTED SYSTEMS [J].
CHANDY, KM ;
LAMPORT, L .
ACM TRANSACTIONS ON COMPUTER SYSTEMS, 1985, 3 (01) :63-75
[8]  
ELNOZAHY EN, 1992, IEEE T COMPUTERS, V41
[9]  
ELNOZAHY M, 1996, CMUCS96181
[10]  
GRAHAM R, 2002, P INT C SUP 2002