Latent Error Prediction and Fault Localization for Microservice Applications by Learning from System Trace Logs

被引:172
作者
Zhou, Xiang [1 ,2 ,5 ]
Peng, Xin [1 ,2 ,5 ]
Xie, Tao [3 ]
Sun, Jun [4 ]
Ji, Chao [1 ,2 ,5 ]
Liu, Dewei [1 ,2 ,5 ]
Xiang, Qilin [1 ,2 ,5 ]
He, Chuan [1 ,2 ,5 ]
机构
[1] Fudan Univ, Sch Comp Sci, Shanghai, Peoples R China
[2] Fudan Univ, Shanghai Key Lab Data Sci, Shanghai, Peoples R China
[3] Univ Illinois, Urbana, IL 61801 USA
[4] Singapore Management Univ, Singapore, Singapore
[5] Shanghai Inst Intelligent Elect & Syst, Shanghai, Peoples R China
来源
ESEC/FSE'2019: PROCEEDINGS OF THE 2019 27TH ACM JOINT MEETING ON EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING | 2019年
关键词
microservices; error prediction; fault localization; tracing; debugging; machine learning; FAILURE; CODE;
D O I
10.1145/3338906.3338961
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
In the production environment, a large part of microservice failures are related to the complex and dynamic interactions and runtime environments, such as those related to multiple instances, environmental configurations, and asynchronous interactions of microservices. Due to the complexity and dynamism of these failures, it is often hard to reproduce and diagnose them in testing environments. It is desirable yet still challenging that these failures can be detected and the faults can be located at runtime of the production environment to allow developers to resolve them efficiently. To address this challenge, in this paper, we propose MEPFL, an approach of latent error prediction and fault localization for microservice applications by learning from system trace logs. Based on a set of features defined on the system trace logs, MEPFL trains prediction models at both the trace level and the microservice level using the system trace logs collected from automatic executions of the target application and its faulty versions produced by fault injection. The prediction models thus can be used in the production environment to predict latent errors, faulty microservices, and fault types for trace instances captured at runtime. We implement MEPFL based on the infrastructure systems of container orchestrator and service mesh, and conduct a series of experimental studies with two open-source microservice applications (one of them being the largest open-source microservice application to our best knowledge). The results indicate that MEPFL can achieve high accuracy in intra-application prediction of latent errors, faulty microservices, and fault types, and outperforms a state-of-the-art approach of failure diagnosis for distributed systems. The results also show that MEPFL can effectively predict latent errors caused by real-world fault cases.
引用
收藏
页码:683 / 694
页数:12
相关论文
共 53 条
[1]  
Abe H, 2008, LECT NOTES ARTIF INT, V5178, P758, DOI 10.1007/978-3-540-85565-1_94
[2]  
Abreu R., 2009, P 8 S ABSTR REF APPR
[3]  
Abreu R, 2006, 12TH PACIFIC RIM INTERNATIONAL SYMPOSIUM ON DEPENDABLE COMPUTING, PROCEEDINGS, P39
[4]   DEBUGGING WITH DYNAMIC SLICING AND BACKTRACKING [J].
AGRAWAL, H ;
DEMILLO, RA ;
SPAFFORD, EH .
SOFTWARE-PRACTICE & EXPERIENCE, 1993, 23 (06) :589-616
[5]  
AGRAWAL H, 1990, SIGPLAN NOTICES, V25, P246, DOI 10.1145/93548.93576
[6]  
Ahmed Jawwad, 2017, 2017 IFIP/IEEE Symposium on Integrated Network and Service Management (IM), P873, DOI 10.23919/INM.2017.7987390
[7]   The efficiency of critical slicing in fault localization [J].
Al-Khanjari, ZA ;
Woodward, MR ;
Ramadhan, H ;
Kutti, NS .
SOFTWARE QUALITY JOURNAL, 2005, 13 (02) :129-153
[8]  
Alves E., 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering, P520, DOI 10.1109/ASE.2011.6100114
[9]  
[Anonymous], 2019, LATENT ERROR PREDICT
[10]  
[Anonymous], 2018, FAULT CASES