MPI/FT: A Model-Based Approach to Low-Overhead Fault Tolerant Message-Passing Middleware

被引:21
作者
Rajanikanth Batchu
Yoginder S. Dandass
Anthony Skjellum
Murali Beddhu
机构
关键词
model-based fault tolerance; MPI; cluster computing; fault detection; group communication;
D O I
10.1023/B:CLUS.0000039491.64560.8a
中图分类号
学科分类号
摘要
Fault tolerance in parallel systems has traditionally been achieved through a combination of redundancy and checkpointing methods. This notion has also been extended to message-passing systems with user-transparent process checkpointing and message logging. Furthermore, studies of multiple types of rollback and recovery have been reported in literature, ranging from communication-induced checkpointing to pessimistic and synchronous solutions. However, many of these solutions incorporate high overhead because of their inability to utilize application level information.
引用
收藏
页码:303 / 315
页数:12
相关论文
共 10 条
[1]  
Huang K.H.(1984)Algorithm-based fault tolerance for matrix operations IEEE Transactions on Computers 33 518-528
[2]  
Abraham J.A.(2001)A variational calculus approach to optimal checkpoint placement IEEE Computer 50 699-708
[3]  
Ling Y.(1990)Fault-tolerant computing: Fundamental concepts IEEE Computer 23 19-25
[4]  
Mi J.(1994)Algorithm-based fault tolerance for FFT net-works IEEE Transactions on Computers 43 849-854
[5]  
Lin X.(1996)Checkpointing in distributed systems Journal of Parallel and Distributed Systems 35 67-75
[6]  
Nelson V.P.(undefined)undefined undefined undefined undefined-undefined
[7]  
Wang S.J.(undefined)undefined undefined undefined undefined-undefined
[8]  
Jha N.K.(undefined)undefined undefined undefined undefined-undefined
[9]  
Wong K.F.(undefined)undefined undefined undefined undefined-undefined
[10]  
Franklin M.(undefined)undefined undefined undefined undefined-undefined