Fast and Message-Efficient Global Snapshot Algorithms for Large-Scale Distributed Systems

被引:19
作者
Kshemkalyani, Ajay D. [1 ]
机构
[1] Univ Illinois, Dept Comp Sci, Chicago, IL 60607 USA
关键词
Distributed system; global state; message passing; distributed snapshot; checkpoint; hypercube; supercomputer; cluster; overlay; RECOVERY; NETWORKS; SEARCH;
D O I
10.1109/TPDS.2010.24
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Large-scale distributed systems such as supercomputers and peer-to-peer systems typically have a fully connected logical topology over a large number of processors. Existing snapshot algorithms in such systems have high response time and/or require a large number of messages, typically O(n(2)), where n is the number of processes. In this paper, we present a suite of two algorithms: simple_tree, and hypercube, that are both fast and require a small number of messages. This makes the algorithms highly scalable. Simple_tree requires O(n) messages and has O(log n) response time. Hypercube requires O(n log n) messages and has O(log n) response time, in addition to having the property that the roles of all the processes are symmetrical. Process symmetry implies greater potential for balanced workload and congestion-freedom. All the algorithms assume non-FIFO channels.
引用
收藏
页码:1281 / 1289
页数:9
相关论文
共 66 条
[1]  
Adler M., 2003, STOC, P575, DOI DOI 10.1145/780542.780626
[2]  
Agarwal S., 2004, Proceedings of the 18th annual international conference on supercomputing, ICS '04, P277, DOI [10.1145/1006209.1006248, DOI 10.1145/1006209.1006248]
[3]  
ANCEAUME E, 2008, P 2 IEEE INT C SELF, P15, DOI DOI 10.1109/SASO.2008.44
[4]  
[Anonymous], 2001, Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems, DOI DOI 10.1007/3-540-45518-3_18
[5]  
[Anonymous], 2007, P 16 INT C WORLD WID
[6]  
Aspnes J, 2003, SIAM PROC S, P384
[7]   COMPLEXITY OF NETWORK SYNCHRONIZATION [J].
AWERBUCH, B .
JOURNAL OF THE ACM, 1985, 32 (04) :804-823
[8]   An index-based checkpointing algorithm for autonomous distributed systems [J].
Baldoni, R ;
Quaglia, F ;
Fornara, P .
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 1999, 10 (02) :181-192
[9]   Dynamic quorums for DHT-based enterprise infrastructures [J].
Baldoni, Roberto ;
Jimenez-Peris, Ricardo ;
Patino-Martinez, Marta ;
Querzoni, Leonardo ;
Virgillito, Antonino .
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2008, 68 (09) :1235-1249
[10]  
BERNDT P, 2008, P 7 INT WORKSH PEER