model-based fault tolerance;
MPI;
cluster computing;
fault detection;
group communication;
D O I:
10.1023/B:CLUS.0000039491.64560.8a
中图分类号:
学科分类号:
摘要:
Fault tolerance in parallel systems has traditionally been achieved through a combination of redundancy and checkpointing methods. This notion has also been extended to message-passing systems with user-transparent process checkpointing and message logging. Furthermore, studies of multiple types of rollback and recovery have been reported in literature, ranging from communication-induced checkpointing to pessimistic and synchronous solutions. However, many of these solutions incorporate high overhead because of their inability to utilize application level information.