Surviving Failures in Bandwidth-Constrained Datacenters

被引:108
作者
Bodik, Peter
Menache, Ishai
Chowdhury, Mosharaf [1 ]
Mani, Pradeepkumar [2 ]
Maltz, David A. [2 ]
Stoica, Ion [1 ]
机构
[1] Univ Calif Berkeley, Berkeley, CA USA
[2] Microsoft Corp, Redmond, WA 98052 USA
关键词
datacenter networks; fault tolerance; bandwidth;
D O I
10.1145/2377677.2377760
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Datacenter networks have been designed to tolerate failures of network equipment and provide sufficient bandwidth. In practice, however, failures and maintenance of networking and power equipment often make tens to thousands of servers unavailable, and network congestion can increase service latency. Unfortunately, there exists an inherent tradeoff between achieving high fault tolerance and reducing bandwidth usage in network core; spreading servers across fault domains improves fault tolerance, but requires additional bandwidth, while deploying servers together reduces bandwidth usage, but also decreases fault tolerance. We present a detailed analysis of a large-scale Web application and its communication patterns. Based on that, we propose and evaluate a novel optimization framework that achieves both high fault tolerance and significantly reduces bandwidth usage in the network core by exploiting the skewness in the observed communication patterns.
引用
收藏
页码:431 / 442
页数:12
相关论文
共 38 条