Extending an Application-Level Checkpointing Tool to Provide Fault Tolerance Support to OpenMP Applications

被引:0
作者
Losada, Nuria [1 ]
Martin, Maria J. [1 ]
Rodriguez, Gabriel [1 ]
Gonzalez, Patricia [1 ]
机构
[1] Univ A Coruna, Comp Architecture Grp, La Coruna, Spain
关键词
parallel programming; OpenMP; fault tolerance; checkpointing; ROLLBACK-RECOVERY; CPPC; MPI;
D O I
暂无
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Despite the increasing popularity of shared-memory systems, there is a lack of tools for providing fault tolerance support to shared-memory applications. CPPC (ComPiler for Portable Checkpointing) is an application-level checkpointing tool focused on the insertion of fault tolerance into long-running MPI applications. This paper presents an extension to CPPC to allow the checkpointing of OpenMP applications. The proposed solution maintains the main characteristics of CPPC: portability and reduced checkpoint file size. The performance of the proposal is evaluated using the OpenMP NAS Parallel Benchmarks showing that most of the applications present small checkpoint overheads.
引用
收藏
页码:1352 / 1372
页数:21
相关论文
共 20 条
[11]   ReVive: Cost-effective architectural support for rollback recovery in shared-memory multiprocessors [J].
Prvulovic, M ;
Zhang, Z ;
Torrellas, J .
29TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, PROCEEDINGS, 2002, :111-122
[12]  
Rodriguez G., 2008, THESIS U CORUNA
[13]   Analysis of Performance-impacting Factors on Checkpointing Frameworks: The CPPC Case Study [J].
Rodriguez, Gabriel ;
Martin, Maria J. ;
Gonzalez, Patricia ;
Tourino, Juan .
COMPUTER JOURNAL, 2011, 54 (11) :1821-1837
[14]   CPPC: a compiler-assisted tool for portable checkpointing of message-passing applications [J].
Rodriguez, Gabriel ;
Martin, Maria J. ;
Gonzalez, Patricia ;
Tourino, Juan ;
Doallo, Ramon .
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2010, 22 (06) :749-766
[15]  
Rodríguez G, 2009, J UNIVERS COMPUT SCI, V15, P2894
[16]   SafetyNet: Improving the availability of shared memory multiprocessors with global checkpoint/recovery [J].
Sorin, DJ ;
Martin, MMK ;
Hill, MD ;
Wood, DA .
29TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, PROCEEDINGS, 2002, :123-134
[17]   CoCheck: Checkpointing and process migration for MPI [J].
Stellner, G .
10TH INTERNATIONAL PARALLEL PROCESSING SYMPOSIUM - PROCEEDINGS OF IPPS '96, 1996, :526-531
[18]  
Tahan Oussama, 2012, Architecture of Computing Systems - ARCS 2012. Proceedings 25th International Conference, P25, DOI 10.1007/978-3-642-28293-5_3
[19]  
Walker DW, 1996, SUPERCOMPUTER, V12, P56
[20]  
Woo N, 2004, IEICE T INF SYST, VE87D, P1820