Trace-based Intelligent Fault Diagnosis for Microservices with Deep Learning

被引:10
作者
Chen, Hao [1 ]
Wei, Kegang [1 ,3 ]
Li, An [1 ]
Wang, Tao [1 ,2 ]
Zhang, Wenbo [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Software, Beijing 100190, Peoples R China
[2] Chinese Acad Sci, Inst Software, State Key Lab Comp Sci, Beijing 100190, Peoples R China
[3] Univ Chinese Acad Sci, Beijing 100190, Peoples R China
来源
2021 IEEE 45TH ANNUAL COMPUTERS, SOFTWARE, AND APPLICATIONS CONFERENCE (COMPSAC 2021) | 2021年
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
microservice; fault diagnosis; deep learning; distributed tracing;
D O I
10.1109/COMPSAC51774.2021.00121
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Due to the scalability, fault tolerance, and high availability, distributed microservice-based applications gradually replace traditional monolithic applications as one of the main forms of Internet applications. However, current fault diagnosis methods for distributed applications have drawbacks in coarse-grained fault location and inaccurate root-cause analysis. To address the above issues, this paper proposes a trace-based intelligent fault diagnosis approach for microservices with deep learning. First, we build a request weighted directed graph and a request string to characterize the behaviors of microservices with collected historical traces. Then, we build a normal trace dataset in normal status and a faulty dataset by injecting faults, and then calculate the expected intervals of microservices' response time and the call sequences. After that, we train the fault diagnosis model based on the deep neural network with the trace datasets to diagnose faulty microservices. Finally, we have deployed a typical open-source microservice-based application TrainTicket to validate our approach by injecting various typical faults. The results show that our approach can effectively characterize the behavior of microservices when processing requests and effectively detect faults. For fault detection, our approach achieves 91.5% accuracy in detecting faults, and has the accuracy of 85.2% in locating root causes.
引用
收藏
页码:884 / 893
页数:10
相关论文
共 29 条
[1]  
[Anonymous], 2020, Elasticsearch
[2]  
[Anonymous], 2020, KUBERNETES
[3]  
[Anonymous], 2017, ABS170407706 CORR
[4]  
[Anonymous], 2021, TRACE ANAL
[5]   Real-Time Anomaly Detection of NoSQL Systems Based on Resource Usage Monitoring [J].
Chouliaras, Spyridon ;
Sotiriadis, Stelios .
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2020, 16 (09) :6042-6049
[6]   Enabling Dependability-Driven Resource Use and Message Log-Analysis for Cluster System Diagnosis [J].
Chuah, Edward ;
Jhumka, Arshad ;
Alt, Samantha ;
Damoulas, Theo ;
Gurumdimma, Nentawe ;
Sawley, Marie-Christine ;
Barth, William L. ;
Minyard, Tommy ;
Browne, James C. .
2017 IEEE 24TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), 2017, :317-327
[7]   Microservices Monitoring with Event Logs and Black Box Execution Tracing [J].
Cinque, Marcello ;
Della Corte, Raffaele ;
Pecchia, Antonio .
IEEE TRANSACTIONS ON SERVICES COMPUTING, 2022, 15 (01) :294-307
[8]   Discovering Hidden Errors from Application Log Traces with Process Mining [J].
Cinque, Marcello ;
Della Corte, Raffaele ;
Pecchia, Antonio .
2019 15TH EUROPEAN DEPENDABLE COMPUTING CONFERENCE (EDCC 2019), 2019, :137-140
[9]  
Istio, 2020, ISTIO
[10]  
Jaeger, 2020, JAEGER