MPI/FT: A Model-Based Approach to Low-Overhead Fault Tolerant Message-Passing Middleware

被引：21

作者：

Rajanikanth Batchu

Yoginder S. Dandass

Anthony Skjellum

Murali Beddhu

机构：

来源：

Cluster Computing | 2004年 / 7卷 / 4期

关键词：

model-based fault tolerance; MPI; cluster computing; fault detection; group communication;

D O I：

10.1023/B:CLUS.0000039491.64560.8a

中图分类号：

学科分类号：

摘要：

Fault tolerance in parallel systems has traditionally been achieved through a combination of redundancy and checkpointing methods. This notion has also been extended to message-passing systems with user-transparent process checkpointing and message logging. Furthermore, studies of multiple types of rollback and recovery have been reported in literature, ranging from communication-induced checkpointing to pessimistic and synchronous solutions. However, many of these solutions incorporate high overhead because of their inability to utilize application level information.

引用

页码：303 / 315

页数：12