Towards Intelligent Incident Management: Why We Need It and How We Make It

被引:56
作者
Chen, Zhuangbin [1 ,6 ]
Kang, Yu [2 ]
Li, Liqun [2 ]
Zhang, Xu [2 ]
Zhang, Hongyu [3 ]
Xu, Hui [4 ]
Zhou, Yangfan [4 ]
Yang, Li [5 ]
Sun, Jeffrey [5 ]
Xu, Zhangwei [5 ]
Dang, Yingnong [5 ]
Gao, Feng [5 ]
Zhao, Pu [2 ]
Qiao, Bo [2 ]
Lin, Qingwei [2 ]
Zhang, Dongmei [2 ]
Lyu, Michael R. [1 ]
机构
[1] Chinese Univ Hong Kong, Hong Kong, Peoples R China
[2] Microsoft Res, Beijing, Peoples R China
[3] Univ Newcastle, Newcastle, NSW, Australia
[4] Fudan Univ, Shanghai, Peoples R China
[5] Microsoft Azure, E Lansing, MI USA
[6] Microsoft Res Asia, Beijing, Peoples R China
来源
PROCEEDINGS OF THE 28TH ACM JOINT MEETING ON EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING (ESEC/FSE '20) | 2020年
基金
中国国家自然科学基金;
关键词
Cloud Computing; Incident Management; AIOps;
D O I
10.1145/3368089.3417055
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The management of cloud service incidents (unplanned interruptions or outages of a service/product) greatly affects customer satisfaction and business revenue. After years of efforts, cloud enterprises are able to solve most incidents automatically and timely. However, in practice, we still observe critical service incidents that occurred in an unexpected manner and orchestrated diagnosis workflow failed to mitigate them. In order to accelerate the understanding of unprecedented incidents and provide actionable recommendations, modern incident management system employs the strategy of AIOps (Artificial Intelligence for IT Operations). In this paper, to provide a broad view of industrial incident management and understand the modern incident management system, we conduct a comprehensive empirical study spanning over two years of incident management practices at Microsoft. Particularly, we identify two critical challenges (namely, incomplete service/resource dependencies and imprecise resource health assessment) and investigate the underlying reasons from the perspective of cloud system design and operations. We also present IcM BRAIN, our AIOps framework towards intelligent incident management, and show its practical benefits conveyed to the cloud services of Microsoft.
引用
收藏
页码:1487 / 1497
页数:11
相关论文
共 40 条
[1]  
Bahl V., 2006, Discovering dependencies for network management
[2]  
Barham P., 2004, P 6 C S OSDI BERK CA, V6, P18
[3]   Toward Web Service Dependency Discovery for SOA Management [J].
Basu, Sujoy ;
Casati, Fabio ;
Daniel, Florian .
2008 IEEE INTERNATIONAL CONFERENCE ON SERVICES COMPUTING, PROCEEDINGS, VOL 2, 2008, :422-+
[4]  
Brown A., 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470), P377, DOI 10.1109/INM.2001.918054
[5]   Continuous Incident Triage for Large-Scale Online Service Systems [J].
Chen, Junjie ;
He, Xiaoting ;
Lin, Qingwei ;
Zhang, Hongyu ;
Hao, Dan ;
Gao, Feng ;
Xu, Zhangwei ;
Dang, Yingnong ;
Zhang, Dongmei .
34TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE 2019), 2019, :364-375
[6]   An Empirical Investigation of Incident Triage for Online Service Systems [J].
Chen, Junjie ;
He, Xiaoting ;
Lin, Qingwei ;
Xu, Yong ;
Zhang, Hongyu ;
Hao, Dan ;
Gao, Feng ;
Xu, Zhangwei ;
Dang, Yingnong ;
Zhang, Dongmei .
2019 IEEE/ACM 41ST INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: SOFTWARE ENGINEERING IN PRACTICE (ICSE-SEIP 2019), 2019, :111-120
[7]  
Chen X., 2008, Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, OSDI'08, P117
[8]  
Chen Yen-YangMichael., 2004, PATH BASED FAILURE E
[9]   Outage Prediction and Diagnosis for Cloud Service Systems [J].
Chen, Yujun ;
Zhang, Hongyu ;
Yang, Xian ;
Lin, Qingwei ;
Zhang, Dongmei ;
Dong, Hang ;
Xu, Yong ;
Li, Hao ;
Kang, Yu ;
Gao, Feng ;
Xu, Zhangwei ;
Dang, Yingnong .
WEB CONFERENCE 2019: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW 2019), 2019, :2659-2665
[10]  
Chen Zhuangbin, 2020, Aiops innovations of incident management for cloud services