Compiler-Assisted Application-Level Checkpointing for MPI Programs

被引:3
作者
Yang, Xuejun [1 ]
Wang, Panfeng [1 ]
Fu, Hongyi [1 ]
Du, Yunfei [1 ]
Wang, Zhiyuan [1 ]
Jia, Jia [1 ]
机构
[1] Natl Univ Def Technol, Natl Lab Parallel & Distributed Proc, Coll Comp, Changsha, Hunan, Peoples R China
来源
28TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS, VOLS 1 AND 2, PROCEEDINGS | 2008年
关键词
D O I
10.1109/ICDCS.2008.25
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Application-level checkpointing can decrease the overhead of fault tolerance by minimizing the amount of checkpoint data. However this technique requires the programmer to manually choose the critical data that should be saved. In this paper, we firstly propose a live-variable analysis method for MPI programs. Then, we provide an optimization method of data saving for application-level checkpointing based on the analysis method. Based on the theoretical foundation, we implement a source-to-source pre-compiler (ALEC) to automate application-level checkpointing. Finally, we evaluate the performance we of five FORTRAN/MPI programs which are transformed and integrated checkpointing features by ALEC on a 512-CPU cluster system. The experimental results show that i)the application-level checkpointing based on live-variable analysis for MPI programs can efficiently reduce the amount of checkpoint data, thereby decrease the overhead of checkpoint and restart; ii)ALEC is capable of automating application-level check-pointing correctly and effectively.
引用
收藏
页码:251 / 259
页数:9
相关论文
共 11 条
  • [1] [Anonymous], 2003, AUTOMATED APPL LEVEL
  • [2] [Anonymous], P 20 ANN IEEE INT FA
  • [3] MPICH-V project: A multiprotocol automatic fault-tolerant MPI
    Bouteiller, A.
    Herault, T.
    Krawezik, G.
    Lemarinier, P.
    Cappello, F.
    [J]. INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2006, 20 (03) : 319 - 333
  • [4] BRONEVETSKY G, 2004, C ARCH SUPP PROGR LA
  • [5] CHOI SE, 2002, HPCS 02, P113
  • [6] Checkpointing for Peta-scale systems: A look into the future of practical rollback-recovery
    Elnozahy, EN
    Plank, JS
    [J]. IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2004, 1 (02) : 97 - 108
  • [7] A survey of rollback-recovery protocols in message-passing systems
    Elnozahy, EN
    Alvisi, L
    Wang, YM
    Johnson, DB
    [J]. ACM COMPUTING SURVEYS, 2002, 34 (03) : 375 - 408
  • [8] Plank JS, 1999, SOFTWARE PRACT EXPER, V29, P125, DOI 10.1002/(SICI)1097-024X(199902)29:2<125::AID-SPE224>3.0.CO
  • [9] 2-7
  • [10] SCHULZ M, 2004, SUPERCOMPUTING SC