Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications

被引:522
作者
Gill, Phillipa [1 ]
Jain, Navendu [1 ]
Nagappan, Nachiappan [1 ]
机构
[1] Univ Toronto, Toronto, ON M5S 1A1, Canada
关键词
Network Management; Performance; Reliability; Data Centers; Network Reliability;
D O I
10.1145/2043164.2018477
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We present the first large-scale analysis of failures in a data center network. Through our analysis, we seek to answer several fundamental questions: which devices/links are most unreliable, what causes failures, how do failures impact network traffic and how effective is network redundancy? We answer these questions using multiple data sources commonly collected by network operators. The key findings of our study are that (1) data center networks show high reliability, (2) commodity switches such as ToRs and AggS are highly reliable, (3) load balancers dominate in terms of failure occurrences with many short-lived software related faults, (4) failures have potential to cause loss of many small packets such as keep alive messages and ACKs, and (5) network redundancy is only 40% effective in reducing the median impact of failure.
引用
收藏
页码:350 / 361
页数:12
相关论文
共 28 条
[1]  
Abu-Libdeh H., 2010, SIGCOMM
[2]  
Al-Fares M., 2008, SIGCOMM
[3]  
[Anonymous], SIGCOMM
[4]  
[Anonymous], 2009, SIGCOMM
[5]  
[Anonymous], DAT CTR LOAD BAL DAT
[6]  
[Anonymous], 2009, SIGCOMM
[7]  
[Anonymous], P ACM SIGCOMM
[8]  
[Anonymous], 2009, SIGMETRICS
[9]  
Benson T., 2010, IMC
[10]  
Benson T., 2010, HotCloud