Scalable Near Real-Time Failure Localization of Data Center Networks

被引:27
作者
Herodotou, Herodotos [1 ]
Ding, Bolin [1 ]
Balakrishnan, Shobana [1 ]
Outhred, Geoff [2 ]
Fitter, Percy [2 ]
机构
[1] Microsoft Res, Redmond, WA 98052 USA
[2] Microsoft, Redmond, WA USA
来源
PROCEEDINGS OF THE 20TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD'14) | 2014年
关键词
Failure localization; Data center networks;
D O I
10.1145/2623330.2623365
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large-scale data center networks are complex-comprising several thousand network devices and several hundred thou sand links-and form the critical infrastructure upon which all higher-level services depend on. Despite the built-in redundancy in data center networks, performance issues and device or link failures in the network can lead to user-perceived service interruptions. Therefore, determining and localizing user-impacting availability and performance issues in the network in near real time is crucial. Traditionally, both passive and active monitoring approaches have been used for failure localization. However, data from passive monitoring is often too noisy and does not effectively capture silent or gray failures, whereas active monitoring is potent in detecting faults but limited in its ability to isolate the exact fault location depending on its scale and granularity. Our key idea is to use statistical data mining techniques on large-scale active monitoring data to determine a ranked list of suspect causes, which we refine with passive monitoring signals. In particular, we compute a failure probability for devices and links in near real time using data from active monitoring, and look for statistically significant increases in the failure probability. We also correlate the probabilistic output with other failure signals from passive monitoring to increase the confidence of the probabilistic analysis. We have implemented our approach in the Windows Azure production environment and have validated its effectiveness in terms of localization accuracy, precision, and time to localization using known network incidents over the past three months. The correlated ranked list of devices and links is surfaced as a report that is used by network operators to investigate current issues and identify probable root causes.
引用
收藏
页码:1689 / 1698
页数:10
相关论文
共 15 条
[1]   A scalable, commodity data center network architecture [J].
Al-Fares, Mohammad ;
Loukissas, Alexander ;
Vahdat, Amin .
ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2008, 38 (04) :63-74
[2]  
[Anonymous], 2012, Proceedings of the 8th international conference on Emerging networking experiments and technologies
[3]  
[Anonymous], 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines
[4]   Towards highly reliable enterprise network services via inference of multi-level dependencies [J].
Bahl, Paramvir ;
Chandra, Ranveer ;
Greenberg, Albert ;
Kandula, Srikanth ;
Maltz, David A. ;
Zhang, Ming .
ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2007, 37 (04) :13-24
[5]   Robust monitoring of link delays and faults in IP networks [J].
Bejerano, Yigal ;
Rastogi, Rajeev .
IEEE-ACM TRANSACTIONS ON NETWORKING, 2006, 14 (05) :1092-1103
[6]   Network tomography of binary network performance characteristics [J].
Duffield, Nick .
IEEE TRANSACTIONS ON INFORMATION THEORY, 2006, 52 (12) :5373-5388
[7]   Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications [J].
Gill, Phillipa ;
Jain, Navendu ;
Nagappan, Nachiappan .
ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2011, 41 (04) :350-361
[8]   VL2: A Scalable and Flexible Data Center Network [J].
Greenberg, Albert ;
Hamilton, James R. ;
Jain, Navendu ;
Kandula, Srikanth ;
Kim, Changhoon ;
Lahiri, Parantap ;
Maltz, David A. ;
Patel, Parveen ;
Sengupta, Sudipta .
ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2009, 39 (04) :51-62
[9]  
Kandula S., 2005, Proceedings of the 2005 ACM SIGCOMM workshop on Mining network data, P173, DOI DOI 10.1145/1080173.1080178
[10]  
Kompella RR, 2005, USENIX ASSOCIATION PROCEEDINGS OF THE 2ND SYMPOSIUM ON NETWORKED SYSTEMS DESIGN & IMPLEMENTATION (NSDI '05), P57