Dynamic Fault Tolerance in Fat Trees

被引:22
作者
Sem-Jacobsen, Frank Olaf [1 ]
Skeie, Tor [1 ,2 ]
Lysne, Olav [1 ,2 ]
Duato, Jose [1 ,3 ]
机构
[1] Simula Res Lab, N-1325 Lysaker, Norway
[2] Univ Oslo, N-0316 Oslo, Norway
[3] Univ Politecn Valencia, Dept Informat Sistemas & Comp, Valencia 46022, Spain
关键词
Fat trees; k-ary n-trees; dynamic fault tolerance; deterministic routing; adaptive routing;
D O I
10.1109/TC.2010.97
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Fat trees are a very common communication architecture in current large-scale parallel computers. The probability of failure in these systems increases with the number of components. We present a routing method for deterministically and adaptively routed fat trees, applicable to both distributed and source routing, that is able to handle several concurrent faults and that transparently returns to the original routing strategy once the faulty components have recovered. The method is local and dynamic, completely masking the fault from the rest of the system. It only requires a small extra functionality in the switches to handle rerouting packets around a fault. The method guarantees connectedness and deadlock and livelock freedom for up to k - 1 benign simultaneous switch and/or link faults where k is half the number of ports in the switches. Our simulation experiments show a graceful degradation of performance as more faults occur. Furthermore, we demonstrate that for most fault combinations, our method will even be able to handle significantly more faults beyond the k - 1 limit with high probability.
引用
收藏
页码:508 / 525
页数:18
相关论文
共 30 条
  • [1] A scalable, commodity data center network architecture
    Al-Fares, Mohammad
    Loukissas, Alexander
    Vahdat, Amin
    [J]. ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2008, 38 (04) : 63 - 74
  • [2] [Anonymous], P WORKSH COMM ARCH C
  • [3] Handling topology changes in InfiniBand
    Bermudez, Aurelio
    Casado, Rafael
    Quiles, Francisco J.
    Duato, Jose
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2007, 18 (02) : 172 - 185
  • [4] Chalasani S., 1990, Proceedings of Supercomputing '90 (Cat. No.90CH2916-5), P244, DOI 10.1109/SUPERC.1990.130027
  • [5] A STUDY OF NON-BLOCKING SWITCHING NETWORKS
    CLOS, C
    [J]. BELL SYSTEM TECHNICAL JOURNAL, 1953, 32 (02): : 406 - 424
  • [6] VL2: A Scalable and Flexible Data Center Network
    Greenberg, Albert
    Hamilton, James R.
    Jain, Navendu
    Kandula, Srikanth
    Kim, Changhoon
    Lahiri, Parantap
    Maltz, David A.
    Patel, Parveen
    Sengupta, Sudipta
    [J]. ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2009, 39 (04) : 51 - 62
  • [7] *IT ASS, 2001, INFINIBAND ARCH SPEC, V1
  • [8] LEE TH, 1993, P IEEE REG 10 C COMP
  • [9] FAT-TREES - UNIVERSAL NETWORKS FOR HARDWARE-EFFICIENT SUPERCOMPUTING
    LEISERSON, CE
    [J]. IEEE TRANSACTIONS ON COMPUTERS, 1985, 34 (10) : 892 - 901
  • [10] MUN Y, 1992, P ACM S APPL COMP, P1