Outage Prediction and Diagnosis for Cloud Service Systems

被引:59
作者
Chen, Yujun [1 ,2 ]
Zhang, Hongyu [3 ]
Yang, Xian [2 ]
Lin, Qingwei [2 ]
Zhang, Dongmei [2 ]
Dong, Hang [2 ]
Xu, Yong [2 ]
Li, Hao [2 ]
Kang, Yu [2 ]
Gao, Feng [4 ]
Xu, Zhangwei [4 ]
Dang, Yingnong [4 ]
机构
[1] Beihang Univ, Beijing 100191, Peoples R China
[2] Microsoft Res, Beijing, Peoples R China
[3] Univ Newcastle, Callaghan, NSW, Australia
[4] Microsoft Azure, Redmond, WA USA
来源
WEB CONFERENCE 2019: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW 2019) | 2019年
基金
中国国家自然科学基金;
关键词
Outage prediction; outage diagnosis; cloud system; system of systems; service availability;
D O I
10.1145/3308558.3313501
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
With the rapid growth of cloud service systems and their increasing complexity, service failures become unavoidable. Outages, which are critical service failures, could dramatically degrade system availability and impact user experience. To minimize service downtime and ensure high system availability, we develop an intelligent outage management approach, called AirAlert, which can forecast the occurrence of outages before they actually happen and diagnose the root cause after they indeed occur. AirAlert works as a global watcher for the entire cloud system, which collects all alerting signals, detects dependency among signals and proactively predicts outages that may happen anywhere in the whole cloud system. We analyze the relationships between outages and alerting signals by leveraging Bayesian network and predict outages using a robust gradient boosting tree based classification method. The proposed outage management approach is evaluated using the outage dataset collected from a Microsoft cloud system and the results confirm the effectiveness of the proposed approach.
引用
收藏
页码:2659 / 2665
页数:7
相关论文
共 30 条
[1]   Basic concepts and taxonomy of dependable and secure computing [J].
Avizienis, A ;
Laprie, JC ;
Randell, B ;
Landwehr, C .
IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2004, 1 (01) :11-33
[2]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[3]  
Chen J., 2019, P 41 ACM IEEE INT C
[4]   LEARNING HIGH-DIMENSIONAL DIRECTED ACYCLIC GRAPHS WITH LATENT AND SELECTION VARIABLES [J].
Colombo, Diego ;
Maathuis, Marloes H. ;
Kalisch, Markus ;
Richardson, Thomas S. .
ANNALS OF STATISTICS, 2012, 40 (01) :294-321
[5]  
Domeniconi C., 2002, Principles of Data Mining and Knowledge Discovery. 6th European Conference, PKDD 2002. Proceedings (Lecture Notes in Artificial Intelligence Vol.2431), P125
[6]   Quantifying event correlations for proactive failure management in networked computing systems [J].
Fu, Song ;
Xu, Cheng-Zhong .
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2010, 70 (11) :1100-1109
[7]  
Fu Song, 2007, P 2007 ACM IEEE C SU, P41
[8]   Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications [J].
Gill, Phillipa ;
Jain, Navendu ;
Nagappan, Nachiappan .
ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2011, 41 (04) :350-361
[9]  
Hoffmann G, 2006, SYM REL DIST SYST, P83
[10]  
Jiang H, 2016, INT C PAR DISTRIB SY, P785, DOI [10.1109/ICPADS.2016.0107, 10.1109/ICPADS.2016.105]