Simulative performance analysis of gossip failure detection for scalable distributed systems

被引:4
作者
Mark W. Burns
Alan D. George
Bradley A. Wallace
机构
[1] University of Florida,High
关键词
Failure Detection; Link Failure; Sandia National Laboratory; Basic Protocol; Network Partition;
D O I
10.1023/A:1019086910915
中图分类号
学科分类号
摘要
Three protocols for gossip-based failure detection services in large-scale heterogeneous clusters are analyzed and compared. The basic gossip protocol provides a means by which failures can be detected in large distributed systems in an asynchronous manner without the limits associated with reliable multicasting for group communications. The hierarchical protocol leverages the underlying network topology to achieve faster failure detection. In addition to studying the effectiveness and efficiency of these two agreement protocols, we propose a third protocol that extends the hierarchical approach by piggybacking gossip information on application-generated messages. The protocols are simulated and evaluated with a fault-injection model for scalable distributed systems comprised of clusters of workstations connected by high-performance networks, such as the CPlant system at Sandia National Laboratories. The model supports permanent and transient node and link failures, with rates specified at simulation time, for processors functioning in a fail-silent fashion. Through high-fidelity, CAD-based modeling and simulation, we demonstrate the strengths and weaknesses of each approach in terms of agreement time, number of gossips, and overall scalability.
引用
收藏
页码:207 / 217
页数:10
相关论文
共 24 条
[1]  
Birman K.(1993)The process group approach to reliable distributed computing Communications of the ACM 36 37-53
[2]  
Boden N.(1995)Myrinet: A gigabit-per-second Local Area Network IEEE Micro 15 26-36
[3]  
Cohen D.(1996)Implementing fail-silent nodes for distributed systems IEEE Transactions on Computers 45 1226-1238
[4]  
Felderman R.(1996)The weakest failure detector for solving consensus Journal of the ACM 43 685-722
[5]  
Kulawik A.(1990)Broadcast protocols for distributed systems IEEE Transactions on Parallel and Distributed Systems 1 17-25
[6]  
Seitz C.(1983)Fail-stop processors: An approach to designing fault-tolerant computing systems ACM Transactions on Computing Systems 1 222-238
[7]  
Seizovic J.(1992)A block-oriented network simulator (BONeS) Simulation 58 83-94
[8]  
Su W.(undefined)undefined undefined undefined undefined-undefined
[9]  
Brasileiro F.(undefined)undefined undefined undefined undefined-undefined
[10]  
Ezhilchelvan P.(undefined)undefined undefined undefined undefined-undefined