MPI/FT: A Model-Based Approach to Low-Overhead Fault Tolerant Message-Passing Middleware

被引：21

作者：

Rajanikanth Batchu

Yoginder S. Dandass

Anthony Skjellum

Murali Beddhu

机构：

来源：

Cluster Computing | 2004年 / 7卷 / 4期

关键词：

model-based fault tolerance; MPI; cluster computing; fault detection; group communication;

D O I：

10.1023/B:CLUS.0000039491.64560.8a

中图分类号：

学科分类号：

摘要：

Fault tolerance in parallel systems has traditionally been achieved through a combination of redundancy and checkpointing methods. This notion has also been extended to message-passing systems with user-transparent process checkpointing and message logging. Furthermore, studies of multiple types of rollback and recovery have been reported in literature, ranging from communication-induced checkpointing to pessimistic and synchronous solutions. However, many of these solutions incorporate high overhead because of their inability to utilize application level information.

引用

页码：303 / 315

页数：12

共 10 条

[1]

Huang K.H.(1984)Algorithm-based fault tolerance for matrix operations IEEE Transactions on Computers 33 518-528

[2]

Abraham J.A.(2001)A variational calculus approach to optimal checkpoint placement IEEE Computer 50 699-708

[3]

Ling Y.(1990)Fault-tolerant computing: Fundamental concepts IEEE Computer 23 19-25

[4]

Mi J.(1994)Algorithm-based fault tolerance for FFT net-works IEEE Transactions on Computers 43 849-854

[5]

Lin X.(1996)Checkpointing in distributed systems Journal of Parallel and Distributed Systems 35 67-75

[6]

Nelson V.P.(undefined)undefined undefined undefined undefined-undefined

[7]

Wang S.J.(undefined)undefined undefined undefined undefined-undefined

[8]

Jha N.K.(undefined)undefined undefined undefined undefined-undefined

[9]

Wong K.F.(undefined)undefined undefined undefined undefined-undefined

[10]

Franklin M.(undefined)undefined undefined undefined undefined-undefined

← 1 →