Communication-induced determination of consistent snapshots

被引:28
作者
Hélary, JM [1 ]
Mostefaoui, A [1 ]
Raynal, M [1 ]
机构
[1] Inst Rech Informat & Syst Aleatoires, F-35042 Rennes, France
关键词
asynchronous distributed computation; checkpointing; communication-induced protocol; consistency; global checkpoint; message recording; snapshot;
D O I
10.1109/71.798312
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
A classical way to determine consistent snapshots consists in using Chandy-Lamport's algorithm. This algorithm relies on specific control messages that allow processes to synchronize local checkpoint determination and message recording in order for the resulting snapshot to be consistent. This paper investigates a communication-induced approach to determine consistent snapshots. In such an approach, control information is carried out by application messages. Two abstract necessary and sufficient conditions are stated: one associated with global checkpoint consistency, the other associated with message recording. A general protocol is derived from these abstract conditions. Actually, this general protocol can be instantiated in distinct ways, giving rise to a family of communication-induced snapshot protocols. This general protocol shows there is an intrinsic trade-off between the number of forced checkpoints and the number of recorded messages. Finally, a particular instantiation of the general protocol is provided.
引用
收藏
页码:865 / 877
页数:13
相关论文
共 25 条
[1]  
Acharya A., 1994, Proceedings of the Third International Conference on Parallel and Distributed Information Systems (Cat. No.94TH0668-4), P73, DOI 10.1109/PDIS.1994.331730
[2]   A unified framework for the specification and run-time detection of dynamic properties in distributed computations [J].
Babaoglu, O ;
Fromentin, E ;
Raynal, M .
JOURNAL OF SYSTEMS AND SOFTWARE, 1996, 33 (03) :287-298
[3]   An index-based checkpointing algorithm for autonomous distributed systems [J].
Baldoni, R ;
Quaglia, F ;
Fornara, P .
SIXTEENTH SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, PROCEEDINGS, 1997, :27-34
[4]  
BALDONI R, 1997, P IEEE INT S FAULT T, P68
[5]  
Briatico D., 1984, Proceedings of the Fourth Symposium on Reliability in Distributed Software and Database Systems (Cat. No. 84CH2082-6), P207
[6]   DISTRIBUTED SNAPSHOTS - DETERMINING GLOBAL STATES OF DISTRIBUTED SYSTEMS [J].
CHANDY, KM ;
LAMPORT, L .
ACM TRANSACTIONS ON COMPUTER SYSTEMS, 1985, 3 (01) :63-75
[7]  
ELNOZAHY EN, 1996, CMUCS96181
[8]   Detection of strong unstable predicates in distributed programs [J].
Garg, VK ;
Waldecker, B .
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 1996, 7 (12) :1323-1333
[9]  
GARG VK, 1995, P INT C SYST SCI MAU, V2, P232
[10]  
GOLDBERG AP, 1991, P ACM ONR WORKSH PAR, P144