Epidemic failure detection and consensus for extreme parallelism

被引:4
作者
Katti, Amogh [1 ]
Di Fatta, Giuseppe [1 ]
Naughton, Thomas [2 ]
Engelmann, Christian [2 ]
机构
[1] Univ Reading, Dept Comp Sci, Reading RG6 6AY, Berks, England
[2] Oak Ridge Natl Lab, Comp Sci & Math Div, Oak Ridge, TN USA
关键词
Fault-tolerant MPI; user-level failure mitigation; failure detection; consensus; Gossip protocols; DISTRIBUTED SYSTEMS; FAULT-TOLERANCE;
D O I
10.1177/1094342017690910
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Future extreme-scale high-performance computing systems will be required to work under frequent component failures. The MPI Forum's User Level Failure Mitigation proposal has introduced an operation, MPI_Comm_shrink, to synchronize the alive processes on the list of failed processes, so that applications can continue to execute even in the presence of failures by adopting algorithm-based fault tolerance techniques. This MPI_Comm_shrink operation requires a failure detection and consensus algorithm. This paper presents three novel failure detection and consensus algorithms using Gossiping. Stochastic pinging is used to quickly detect failures during the execution of the algorithm, failures are then disseminated to all the fault-free processes in the system and consensus on the failures is detected using the three consensus techniques. The proposed algorithms were implemented and tested using the Extreme-scale Simulator. The results show that the stochastic pinging detects all the failures in the system. In all the algorithms, the number of Gossip cycles to achieve global consensus scales logarithmically with system size. The second algorithm also shows better scalability in terms of memory and network bandwidth usage and a perfect synchronization in achieving global consensus. The third approach is a three-phase distributed failure detection and consensus algorithm and provides consistency guarantees even in very large and extreme-scale systems while at the same time being memory and bandwidth efficient.
引用
收藏
页码:729 / 743
页数:15
相关论文
共 30 条
  • [11] Unreliable failure detectors for reliable distributed systems
    Chandra, TD
    Toueg, S
    [J]. JOURNAL OF THE ACM, 1996, 43 (02) : 225 - 267
  • [12] Scaling to a million cores and beyond: Using light-weight simulation to understand the challenges ahead on the road to exascale
    Engelmann, Christian
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2014, 30 : 59 - 65
  • [13] IMPOSSIBILITY OF DISTRIBUTED CONSENSUS WITH ONE FAULTY PROCESS
    FISCHER, MJ
    LYNCH, NA
    PATERSON, MS
    [J]. JOURNAL OF THE ACM, 1985, 32 (02) : 374 - 382
  • [14] Geist Al, 2016, IEEE Spectrum, V53, P30, DOI 10.1109/MSPEC.2016.7420396
  • [15] Gupta Indranil., 2001, Proceedings of the twentieth annual ACM symposium on Principles of distributed computing (PODC '01). ACM, New York, NY, P170
  • [16] ALGORITHM-BASED FAULT TOLERANCE FOR MATRIX OPERATIONS
    HUANG, KH
    ABRAHAM, JA
    [J]. IEEE TRANSACTIONS ON COMPUTERS, 1984, 33 (06) : 518 - 528
  • [17] Hursey J, 2011, LECT NOTES COMPUT SC, V6960, P255, DOI 10.1007/978-3-642-24449-0_29
  • [18] Resilience to Various Failures for Read-mostly In-memory Data Structures
    Kaplan, Larry
    Ohlrich, Miles
    Briggs, Preston
    Leslie, Will
    [J]. 2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS & PHD FORUM (IPDPSW), 2012, : 1572 - 1580
  • [19] Kaul H., 2012, P 49 ANN DES AUT C D
  • [20] Gossip-based computation of aggregate information
    Kempe, D
    Dobra, A
    Gehrke, J
    [J]. 44TH ANNUAL IEEE SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE, PROCEEDINGS, 2003, : 482 - 491