Fundamentals of fault-tolerant distributed computing in asynchronous environments

被引:130
作者
Gärtner, FC [1 ]
机构
[1] Tech Univ Darmstadt, Dept Comp Sci, D-64283 Darmstadt, Germany
关键词
asynchronous system; agreement problem; consensus problem; failure correction; failure detection; fault models; fault tolerance; liveness; message passing; possibility detection; predicate detection; redundancy; safety;
D O I
10.1145/311531.311532
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Fault tolerance in distributed computing is a wide area with a significant body of literature that is vastly diverse in methodology and terminology. This paper aims at structuring the area and thus guiding readers into this interesting field. We use a formal approach to define important terms like fault, fault tolerance, and redundancy. This leads to four distinct forms of fault tolerance and to two main phases in achieving them: detection and correction. We show that this can help to reveal inherently fundamental structures that contribute to understanding and unifying methods and terminology. By doing this, we survey many existing methodologies and discuss their relations, The underlying system model is the close-to-reality asynchronous message-passing model of distributed computing.
引用
收藏
页码:1 / 26
页数:26
相关论文
共 97 条
[1]  
Aguilera MK, 1998, LECT NOTES COMPUT SC, V1499, P231, DOI 10.1007/BFb0056486
[2]  
Aguilera MK, 1997, LECT NOTES COMPUT SC, V1320, P126, DOI 10.1007/BFb0030680
[3]  
AGUILERA MK, 1997, TR971640 CORN U DEP
[4]  
AGUILERA MK, 1997, TR971632 CORN U DEP
[5]  
ALMEIDA C, 1998, RT9804 CTI I SUP TEC
[6]  
ALMEIDA C, 1998, P 19 IEEE S REALT SY
[7]   DEFINING LIVENESS [J].
ALPERN, B ;
SCHNEIDER, FB .
INFORMATION PROCESSING LETTERS, 1985, 21 (04) :181-185
[8]  
[Anonymous], TR941425 CORN U DEP
[9]   DISTRIBUTED RESET [J].
ARORA, A ;
GOUDA, M .
IEEE TRANSACTIONS ON COMPUTERS, 1994, 43 (09) :1026-1038
[10]  
Arora A, 1996, J HIGH SPEED NETW, V5, P293