TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems

被引:12
作者
Ding, Ruomeng [1 ]
Zhang, Chaoyun [1 ]
Wang, Lu [1 ]
Xu, Yong [1 ]
Ma, Minghua [1 ]
Wu, Xiaomin [1 ]
Zhang, Meng [2 ]
Chen, Qingjun [2 ]
Gao, Xin [2 ]
Gao, Xuedong [2 ]
Fan, Hao [2 ]
Rajmohan, Saravan [2 ]
Lin, Qingwei [1 ]
Zhang, Dongmei [1 ]
机构
[1] Microsoft, Beijing, Peoples R China
[2] Microsoft 365, Beijing, Peoples R China
来源
PROCEEDINGS OF THE 31ST ACM JOINT MEETING EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, ESEC/FSE 2023 | 2023年
关键词
Trace data; Root Cause Analysis; Reinforcement Learning; DIAGNOSIS; GRAPH;
D O I
10.1145/3611643.3613864
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Root Cause Analysis (RCA) is becoming increasingly crucial for ensuring the reliability of microservice systems. However, performing RCA on modern microservice systems can be challenging due to their large scale, as they usually comprise hundreds of components, leading significant human effort. This paper proposes TraceDiag, an end-to-end RCA framework that addresses the challenges for large-scale microservice systems. It leverages reinforcement learning to learn a pruning policy for the service dependency graph to automatically eliminates redundant components, thereby significantly improving the RCA efficiency. The learned pruning policy is interpretable and fully adaptive to new RCA instances. With the pruned graph, a causal-based method can be executed with high accuracy and efficiency. The proposed TraceDiag framework is evaluated on real data traces collected from the Microsoft Exchange system, and demonstrates superior performance compared to state-of-the-art RCA approaches. Notably, TraceDiag has been integrated as a critical component in the Microsoft M365 Exchange, resulting in a significant improvement in the system's reliability and a considerable reduction in the human effort required for RCA.
引用
收藏
页码:1762 / 1773
页数:12
相关论文
共 64 条
[1]   Exploiting dynamic spatio-temporal graph convolutional neural networks for citywide traffic flows prediction [J].
Ali, Ahmad ;
Zhu, Yanmin ;
Zakarya, Muhammad .
NEURAL NETWORKS, 2022, 145 :233-247
[2]  
Alipour Mir Mohammad, 2022, Signal Data Process, V19, P87
[3]   On-line fault detection and diagnosis obtained by implementing neural algorithms on a digital signal processor [J].
Bernieri, A ;
Betta, G ;
Liguori, C .
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 1996, 45 (05) :894-899
[4]  
Bloebaum P, 2024, Arxiv, DOI [arXiv:2206.06821, 10.48550/ARXIV.2206.06821, DOI 10.48550/ARXIV.2206.06821]
[5]   Graph-based root cause analysis for service-oriented and microservice architectures [J].
Brandon, Alvaro ;
Sole, Marc ;
Huelamo, Alberto ;
Solans, David ;
Perez, Maria S. ;
Muntes-Mulero, Victor .
JOURNAL OF SYSTEMS AND SOFTWARE, 2020, 159
[6]  
Budhathoki K, 2021, PR MACH LEARN RES, V130
[7]  
Chen HP, 2021, PR MACH LEARN RES, V161, P1535
[8]  
Chen PF, 2014, IEEE INFOCOM SER, P1887, DOI 10.1109/INFOCOM.2014.6848128
[9]  
Chen YH, 2023, Arxiv, DOI [arXiv:2307.00754, DOI 10.48550/ARXIV.2307.00754]
[10]  
Elon Mask, There are 1200 "microservices"server side