Surviving Failures in Bandwidth-Constrained Datacenters

被引:110
作者
Bodik, Peter
Menache, Ishai
Chowdhury, Mosharaf [1 ]
Mani, Pradeepkumar [2 ]
Maltz, David A. [2 ]
Stoica, Ion [1 ]
机构
[1] Univ Calif Berkeley, Berkeley, CA USA
[2] Microsoft Corp, Redmond, WA 98052 USA
关键词
datacenter networks; fault tolerance; bandwidth;
D O I
10.1145/2377677.2377760
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Datacenter networks have been designed to tolerate failures of network equipment and provide sufficient bandwidth. In practice, however, failures and maintenance of networking and power equipment often make tens to thousands of servers unavailable, and network congestion can increase service latency. Unfortunately, there exists an inherent tradeoff between achieving high fault tolerance and reducing bandwidth usage in network core; spreading servers across fault domains improves fault tolerance, but requires additional bandwidth, while deploying servers together reduces bandwidth usage, but also decreases fault tolerance. We present a detailed analysis of a large-scale Web application and its communication patterns. Based on that, we propose and evaluate a novel optimization framework that achieves both high fault tolerance and significantly reduces bandwidth usage in the network core by exploiting the skewness in the observed communication patterns.
引用
收藏
页码:431 / 442
页数:12
相关论文
共 38 条
[21]   Clique is hard to approximate within n1-ε [J].
Håstad, J .
ACTA MATHEMATICA, 1999, 182 (01) :105-142
[22]  
Kandula S., 2009, IMC
[23]   A fast and high quality multilevel scheme for partitioning irregular graphs [J].
Karypis, G ;
Kumar, V .
SIAM JOURNAL ON SCIENTIFIC COMPUTING, 1998, 20 (01) :359-392
[24]  
Krauthgamer R, 2009, PROCEEDINGS OF THE TWENTIETH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, P942
[25]   Survivable routing of mesh topologies in IP-over-WDM networks by recursive graph contraction [J].
Kurant, Maciej ;
Thiran, Patrick .
IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 2007, 25 (05) :922-933
[26]  
Lam T., 2010, CS20100957 UC
[27]  
Lau W., 2008, IEEE TNSM, V1, P11
[28]  
Liu Yun., 2004, MMB, P369
[29]  
Mysore RN, 2009, ACM SIGCOMM COMP COM, V39, P39
[30]  
Raghunath S., 2004, ACM SIGCOMM, P342