Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models

被引:24
作者
Ahmed, Toufique [1 ]
Ghosh, Supriyo [2 ]
Bansal, Chetan [2 ]
Zimmermann, Thomas [2 ,3 ]
Zhang, Xuchao [2 ]
Rajmohan, Saravan [2 ]
机构
[1] Univ Calif Davis, Davis, CA 95616 USA
[2] Microsoft, Supriyo Ghosh, Bangalore, Karnataka, India
[3] Microsoft Res, Thomas Zimmermann, Bellevue, WA USA
来源
2023 IEEE/ACM 45TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ICSE | 2023年
关键词
Incident Management; Service Quality; GPT-3.x; Large Language Models;
D O I
10.1109/ICSE48619.2023.00149
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Incident management for cloud services is a complex process involving several steps and has a huge impact on both service health and developer productivity. On-call engineers require significant amount of domain knowledge and manual effort for root causing and mitigation of production incidents. Recent advances in artificial intelligence has resulted in state-of-the-art large language models like GPT-3.x (both GPT-3.0 and GPT-3.5), which have been used to solve a variety of problems ranging from question answering to text summarization. In this work, we do the first large-scale study to evaluate the effectiveness of these models for helping engineers root cause and mitigate production incidents. We do a rigorous study at Microsoft, on more than 40,000 incidents and compare several large language models in zero-shot, fine-tuned and multi-task setting using semantic and lexical metrics. Lastly, our human evaluation with actual incident owners show the efficacy and future potential of using artificial intelligence for resolving cloud incidents.
引用
收藏
页码:1737 / 1749
页数:13
相关论文
共 64 条
[1]  
Ahmad WU, 2021, 2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), P2655
[2]  
AHMED T, 2022, 37 IEEEACM INT C AUT, P1
[3]   Multilingual training for Software Engineering [J].
Ahmed, Toufique ;
Devanbu, Premkumar .
2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2022), 2022, :1443-1455
[4]  
Alquraan A, 2018, PROCEEDINGS OF THE 13TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P51
[5]  
[Anonymous], 2008, US
[6]  
Azad A. P., 2022, P AAAI C ARTIFICIAL, V36, p12 440
[7]  
Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, 10.48550/arXiv.1409.0473]
[8]  
Banerjee Satanjeev, 2005, P ACL WORKSH INTR EX, P65
[9]   DeCaf: Diagnosing and Triaging Performance Issues in Large-Scale Cloud Services [J].
Bansal, Chetan ;
Renganathan, Sundararajan ;
Asudani, Ashima ;
Midy, Olivier ;
Janakiraman, Mathru .
2020 IEEE/ACM 42ND INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: SOFTWARE ENGINEERING IN PRACTICE (ICSE-SEIP), 2020, :201-210
[10]  
Bareiss P, 2022, Arxiv, DOI arXiv:2206.01335