model-based fault tolerance;
MPI;
cluster computing;
fault detection;
group communication;
D O I:
10.1023/B:CLUS.0000039491.64560.8a
中图分类号:
学科分类号:
摘要:
Fault tolerance in parallel systems has traditionally been achieved through a combination of redundancy and checkpointing methods. This notion has also been extended to message-passing systems with user-transparent process checkpointing and message logging. Furthermore, studies of multiple types of rollback and recovery have been reported in literature, ranging from communication-induced checkpointing to pessimistic and synchronous solutions. However, many of these solutions incorporate high overhead because of their inability to utilize application level information.
机构:
Smart Microgrid Research Center, Najafabad Branch, Islamic Azad University, NajafabadDepartment of Electrical Engineering, Najafabad Branch, Islamic Azad University, Najafabad
Kargar S.M.
International Journal of Modelling, Identification and Control,
2021,
37
(3-4):
: 354
-
365
机构:
Smart Microgrid Research Center, Najafabad Branch, Islamic Azad University, NajafabadDepartment of Electrical Engineering, Najafabad Branch, Islamic Azad University, Najafabad
Kargar S.M.
International Journal of Modelling, Identification and Control,
2021,
37
(3-4):
: 354
-
365