C3:: A system for automating application-level checkpointing of MPI programs

被引：2

作者：

Bronevetsky, G ^{[1
]}

Marques, D ^{[1
]}

Pingali, K ^{[1
]}

Stodghill, P ^{[1
]}

机构：

[1] Cornell Univ, Dept Comp Sci, Ithaca, NY 14853 USA

来源：

LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING | 2004年 / 2958卷

关键词：

Application level - Check pointing - Checkpointing techniques - Coordination protocols - Equivalent faults - High-performance platforms - MPI applications - Program variables;

D O I：

10.1007/978-3-540-24644-2_23

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Fault-tolerance is becoming necessary on high-performance platforms. Checkpointing techniques make programs fault-tolerant by saving their state periodically and restoring this state after failure. System-level checkpointing saves the state of the entire machine on stable storage, but this usually has too much overhead. In practice, programmers do manual application-level checkpointing by writing code to W save the values of key program variables at critical points in the program, and (ii) restore the entire computational state from these values during recovery. However, this can be difficult to do in general MPI programs. In ([1],[2]) we have presented a distributed checkpoint coordination protocol which handles MPI's point-to-point and collective constructs, while dealing with the unique challenges of application-level checkpointing. We have implemented our protocols as part of a thin software layer that sits between the application program and the MPI library, so it does not require any modifications to the MPI library. This thin layer is used by the C-3 (Cornell Checkpoint (pre-) Compiler), a tool that automatically converts an MPI application in an equivalent fault-tolerant version. In this paper, we summarize our work on this system to date. We also present experimental results that show that the overhead introduced by the protocols are small. We also discuss a number of future areas of research.

引用

页码：357 / 373

页数：17