MPI/FT: A Model-Based Approach to Low-Overhead Fault Tolerant Message-Passing Middleware

被引:21
|
作者
Rajanikanth Batchu
Yoginder S. Dandass
Anthony Skjellum
Murali Beddhu
机构
关键词
model-based fault tolerance; MPI; cluster computing; fault detection; group communication;
D O I
10.1023/B:CLUS.0000039491.64560.8a
中图分类号
学科分类号
摘要
Fault tolerance in parallel systems has traditionally been achieved through a combination of redundancy and checkpointing methods. This notion has also been extended to message-passing systems with user-transparent process checkpointing and message logging. Furthermore, studies of multiple types of rollback and recovery have been reported in literature, ranging from communication-induced checkpointing to pessimistic and synchronous solutions. However, many of these solutions incorporate high overhead because of their inability to utilize application level information.
引用
收藏
页码:303 / 315
页数:12
相关论文
共 2 条