On implementing omega in systems with weak reliability and synchrony assumptions

被引:33
作者
Aguilera, Marcos K. [1 ]
Delporte-Gallet, Carole [2 ]
Fauconnier, Hugues [2 ]
Toueg, Sam [3 ]
机构
[1] HP Labs, Palo Alto, CA 94304 USA
[2] Univ Paris 07, LIAFA, F-75205 Paris 13, France
[3] Univ Toronto, Dept Comp Sci, Toronto, ON M5S 3G4, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
D O I
10.1007/s00446-008-0068-y
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We study the feasibility and cost of implementing Omega-a fundamental failure detector at the core of many algorithms-in systems with weak reliability and synchrony assumptions. Intuitively, Omega allows processes to eventually elect a common leader. We first give an algorithm that implements Omega in a weak system S where (a) except for some unknown timely process s, all processes may be arbitrarily slow or may crash, and (b) only the output links of s are eventually timely (all other links can be arbitrarily slow and lossy). Previously known algorithms for Omega worked only in systems that are strictly stronger than S in terms of reliability or synchrony assumptions.We next show that algorithms that implement Omega in system S are necessarily expensive in terms of communication complexity: all correct processes (except possibly one) must send messages forever; moreover, a quad-ratic number of links must carry messages forever. This result holds even for algorithms that tolerate at most one crash. Finally, we show that with a small additional assumption to system S-the existence of some unknown correct process whose links can be arbitrarily slow and lossy but fair-there is a communication-efficient algorithm for Omega such that eventually only one process (the elected leader) sends messages. Some recent experimental results indicate that two of the algorithms for Omega described in this paper can be used in dynamically-changing systems and work well in practice [Schiper, Toueg in Proceedings of the 38th International Conference on Dependable Systems and Networks, pp. 207-216 (2008)].
引用
收藏
页码:285 / 314
页数:30
相关论文
共 36 条
[1]  
Aguilera M.K., 2004, Proceedings of the 23rd Annual ACM Symposium on Principles of Distributed Computing, P328
[2]  
Aguilera M.K., 2003, Proceedings of the 22nd Annual Symposium on Principles of Distributed Computing, P306, DOI [10.1145/872035.872081, DOI 10.1145/872035.872081]
[3]  
Aguilera MK, 2001, P 15 INT C DISTRIBUT, P108
[4]  
AGUILERA MK, TYPE FAIRNESS UNPUB
[5]   Implementation and performance evaluation of an adaptable failure detector [J].
Bertier, M ;
Marin, O ;
Sens, P .
INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS, PROCEEDINGS, 2002, :354-363
[6]   Practical byzantine fault tolerance and proactive recovery [J].
Castro, M ;
Liskov, B .
ACM TRANSACTIONS ON COMPUTER SYSTEMS, 2002, 20 (04) :398-461
[7]   Unreliable failure detectors for reliable distributed systems [J].
Chandra, TD ;
Toueg, S .
JOURNAL OF THE ACM, 1996, 43 (02) :225-267
[8]   The weakest failure detector for solving Consensus [J].
Chandra, TD ;
Hadzilacos, V ;
Toueg, S .
JOURNAL OF THE ACM, 1996, 43 (04) :685-722
[9]  
Chandra T, 2007, PODC'07: PROCEEDINGS OF THE 26TH ANNUAL ACM SYMPOSIUM ON PRINCIPLES OF DISTRIBUTED COMPUTING, P398
[10]   On the quality of service of failure detectors [J].
Chen, W ;
Toueg, S ;
Aguilera, MK .
IEEE TRANSACTIONS ON COMPUTERS, 2002, 51 (05) :561-580