The design and implementation of a fault-tolerant RPC system: Ninf-C

被引:7
作者
Nakada, H [1 ]
Tanaka, Y [1 ]
Matsuoka, S [1 ]
Sekiguchi, S [1 ]
机构
[1] Natl Inst Adv Ind Sci & Technol, AIST, Tsukuba, Ibaraki 3058568, Japan
来源
SEVENTH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND GRID IN ASIA PACIFIC REGION, PROCEEDINGS | 2004年
关键词
D O I
10.1109/HPCASIA.2004.1324011
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We describe the design and implementation of a fault tolerant GridRPC system, Ninf-C, designed for easy programming of large-scale master-worker programs that take from few days to few months for its execution in a Grid environment. Ninf-C employs Condor developed at University of Wisconsin, as the underlying middleware supporting remote file transmission and checkpointing for system-wide robustness for application users on the Grid. Ninf-C layers all the GridRPC communication and task parallel programming features on top of Condor in a non-trivial fashion, assuming that the entire program is structured in a master-worker style-in fact, older Ninf master-worker programs can be run directly or trivially ported to Ninf-C. In contrast to the original Ninf Ninf-C exploits and extends Condor features extensively for robustness and transparency, such as 1) checkpointing and stateful recovery of the master process, 2) the master and workers mutually communicating using (remote) files, not IP sockets, and 3) automated throttling of parallel GridRPC calls; and in contrast to using Condor directly, programmers can set up complex dynamic workflow as well as master-worker parallel structure with almost no learning curve involved. To prove the robustness of the system, we performed an experiment on a heterogeneous cluster that consists of x86 and SPARC CPUs, and ran a simple but long-running master-worker program with staged rebooting of multiple nodes to simulate some serious fault situations. The program execution finished normally avoiding all the fault scenarios, demonstrating the robustness of Ninf-C.
引用
收藏
页码:9 / 18
页数:10
相关论文
共 13 条
[1]  
Aida K., 2003, P 3 IEEE ACM INT S C
[2]  
[Anonymous], 2003, J GRID COMPUT
[3]  
BUYYA R, 2000, P HPC AS 2000
[4]  
Foster I., 1996, P WORKSH ENV TOOLS S
[5]  
FUJISAWA K, 2000, I STAT MATH COOPERAT, V135, P215
[6]   An enabling framework for master-worker applications on the computational grid [J].
Goux, JP ;
Kulkarni, S ;
Linderoth, J ;
Yoder, M .
NINTH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE DISTRIBUTED COMPUTING, PROCEEDINGS, 2000, :43-50
[7]  
LIVNY M, 1997, SPEEDUP J, V11
[8]   Design and implementations of Ninf: towards a global computing infrastructure [J].
Nakada, H ;
Sato, M ;
Sekiguchi, S .
FUTURE GENERATION COMPUTER SYSTEMS, 1999, 15 (5-6) :649-658
[9]  
NAKADA H, 2003, GRID COMPUTING MAKIN, P625
[10]  
RAMAN R, 1998, P HPDC 7