Surviving Failures in Bandwidth-Constrained Datacenters

被引：110

作者：

Bodik, Peter

Menache, Ishai

Chowdhury, Mosharaf ^{[1
]}

Mani, Pradeepkumar ^{[2
]}

Maltz, David A. ^{[2
]}

Stoica, Ion ^{[1
]}

机构：

[1] Univ Calif Berkeley, Berkeley, CA USA

[2] Microsoft Corp, Redmond, WA 98052 USA

来源：

ACM SIGCOMM COMPUTER COMMUNICATION REVIEW | 2012年 / 42卷 / 04期

关键词：

datacenter networks; fault tolerance; bandwidth;

D O I：

10.1145/2377677.2377760

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Datacenter networks have been designed to tolerate failures of network equipment and provide sufficient bandwidth. In practice, however, failures and maintenance of networking and power equipment often make tens to thousands of servers unavailable, and network congestion can increase service latency. Unfortunately, there exists an inherent tradeoff between achieving high fault tolerance and reducing bandwidth usage in network core; spreading servers across fault domains improves fault tolerance, but requires additional bandwidth, while deploying servers together reduces bandwidth usage, but also decreases fault tolerance. We present a detailed analysis of a large-scale Web application and its communication patterns. Based on that, we propose and evaluate a novel optimization framework that achieves both high fault tolerance and significantly reduces bandwidth usage in the network core by exploiting the skewness in the observed communication patterns.

引用

页码：431 / 442

页数：12

共 38 条

[21] Clique is hard to approximate within n1-ε [J].