Resilient gossip algorithms for collecting online management information in exascale clusters

被引:5
作者
Barak, Amnon [1 ]
Drezner, Zvi [2 ]
Levy, Ely [1 ]
Lieber, Matthias [3 ]
Shiloh, Amnon [1 ]
机构
[1] Hebrew Univ Jerusalem, Dept Comp Sci, IL-91904 Jerusalem, Israel
[2] Calif State Univ Fullerton, Coll Business & Econ, Fullerton, CA 92834 USA
[3] Tech Univ Dresden, Ctr Informat Serv & High Performance Comp, D-01062 Dresden, Germany
关键词
exascale clusters; gossip algorithms; resource management;
D O I
10.1002/cpe.3465
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Management of forthcoming exascale clusters requires frequent collection of run-time information about the nodes and the running applications. This paper presents a new paradigm for providing online information to the management system of scalable clusters, consisting of a large number of nodes and one or more masters that manage these nodes. We describe the details of resilient gossip algorithms for sharing local information within subsets of nodes and for sending global information to a master, which holds information on all the nodes. The presented algorithms are decentralized, scalable and resilient, working well even when some nodes fail, without needing any recovery protocol. The paper gives formal expressions for approximating the average ages of the local information at each node and the information collected by the master. It then shows that these results closely match the results of simulations and measurements on a real cluster. The paper also investigates the resilience of the algorithms and the impact on the average age when nodes or masters fail. The main outcome of this paper is that partitioning of large clusters can improve the quality of information available to the management system without increasing the number of messages per node. Copyright (c) 2015 John Wiley & Sons, Ltd.
引用
收藏
页码:4797 / 4818
页数:22
相关论文
共 22 条
  • [1] Randomized gossip algorithms for maintaining a distributed bulletin board with guaranteed age properties
    Amar, Lior
    Barak, Amnon
    Drezner, Zvi
    Okun, Michael
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2009, 21 (15) : 1907 - 1927
  • [2] [Anonymous], 1948, Handbook of Mathematical Functions withFormulas, Graphs, and Mathematical Tables, DOI DOI 10.1119/1.15378
  • [3] Barak A., The MOSIX Cluster Management System for Distributed Computing on Linux Clusters and Multi-Cluster Private Clouds
  • [4] Bhatele A, 2013, P SC 13 DENV CO US, P41
  • [5] Bohm S., 2010, Proceedings of the 2010 IEEE 12th International Conference on High Performance Computing and Communications (HPCC 2010), P72, DOI 10.1109/HPCC.2010.32
  • [6] Chen D., 2011, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, p26:1
  • [7] Cuenca-Acuna FM, 2003, 12TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE DISTRIBUTED COMPUTING, PROCEEDINGS, P236
  • [8] Geographic gossip: Efficient averaging for sensor networks
    Dimakis, Alexandros D. G.
    Sarwate, Anand D.
    Wainwright, Martin J.
    [J]. IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2008, 56 (03) : 1205 - 1216
  • [9] Frey D., 2010, PEER TO PEER COMPUTI, P1
  • [10] Peer-to-peer membership management for gossip-based protocols
    Ganesh, AJ
    Kermarrec, AM
    Massoulié, L
    [J]. IEEE TRANSACTIONS ON COMPUTERS, 2003, 52 (02) : 139 - 149