Extending an Application-Level Checkpointing Tool to Provide Fault Tolerance Support to OpenMP Applications

被引:0
作者
Losada, Nuria [1 ]
Martin, Maria J. [1 ]
Rodriguez, Gabriel [1 ]
Gonzalez, Patricia [1 ]
机构
[1] Univ A Coruna, Comp Architecture Grp, La Coruna, Spain
关键词
parallel programming; OpenMP; fault tolerance; checkpointing; ROLLBACK-RECOVERY; CPPC; MPI;
D O I
暂无
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Despite the increasing popularity of shared-memory systems, there is a lack of tools for providing fault tolerance support to shared-memory applications. CPPC (ComPiler for Portable Checkpointing) is an application-level checkpointing tool focused on the insertion of fault tolerance into long-running MPI applications. This paper presents an extension to CPPC to allow the checkpointing of OpenMP applications. The proposed solution maintains the main characteristics of CPPC: portability and reduced checkpoint file size. The performance of the proposal is evaluated using the OpenMP NAS Parallel Benchmarks showing that most of the applications present small checkpoint overheads.
引用
收藏
页码:1352 / 1372
页数:21
相关论文
共 20 条
[1]  
Ahn S, 2003, LECT NOTES COMPUT SC, V2840, P302
[2]  
[Anonymous], P 2010 IEEE C EV COM
[3]  
Ansel J., 2009, P 23 IEEE INT PAR DI
[4]  
BEGUELIN A, 1994, CMUCS94153
[5]  
Bouteiller Aurelien., 2003, Supercomputing Conference, P25, DOI [DOI 10.1145/1048935.1050176, 10.1145/1048935.1050176]
[6]   Application-level checkpointing for shared memory programs [J].
Bronevetsky, G ;
Marques, D ;
Pingali, K ;
Szwed, P ;
Schulz, M .
ACM SIGPLAN NOTICES, 2004, 39 (11) :235-247
[7]  
Bronevetsky G., 2006, Proceedings of the 20th Annual International Conference on Supercomputing, P2, DOI [10.1145/1183401.1183405, DOI 10.1145/1183401.1183405]
[8]  
Chen Yuqun., 1997, SC C, P33
[9]   Improving Scalability of Application-Level Checkpoint-Recovery by Reducing Checkpoint Sizes [J].
Cores, Ivan ;
Rodriguez, Gabriel ;
Martin, Maria J. ;
Gonzalez, Patricia ;
Osorio, Roberto R. .
NEW GENERATION COMPUTING, 2013, 31 (03) :163-185
[10]   A user-level checkpointing library for POSIX threads programs [J].
Dieter, WR ;
Lumpp, JE .
TWENTY-NINTH ANNUAL INTERNATIONAL SYMPOSIUM ON FAULT-TOLERANT COMPUTING, DIGEST OF PAPERS, 1999, :224-227