No More Data Silos: Unified Microservice Failure Diagnosis With Temporal Knowledge Graph

被引:1
作者
Zhang, Shenglin [1 ]
Zhao, Yongxin [2 ]
Xia, Sibo [2 ]
Wei, Shirui [3 ]
Sun, Yongqian [4 ]
Zhao, Chenyu [5 ]
Ma, Shiyu [2 ]
Kuang, Junhua [2 ]
Zhu, Bolin [6 ]
Pan, Lemeng [7 ]
Guo, Yicheng [7 ]
Pei, Dan [8 ]
机构
[1] Nankai Univ, Coll Software, Haihe Lab Informat Technol Applicat Innovat HL IT, Tianjin 300071, Peoples R China
[2] Nankai Univ, Tianjin 300192, Peoples R China
[3] Univ Chinese Acad Sci, Beijing 101408, Peoples R China
[4] Nankai Univ, Coll Software, Tianjin Key Lab Software Experience & Human Comp I, Tianjin 300192, Peoples R China
[5] Alibaba Grp, Beijing 100020, Peoples R China
[6] Nanjing Univ, Nanjing 210093, Peoples R China
[7] Huawei Technol Co Ltd, Shenzhen 518129, Peoples R China
[8] Tsinghua Univ, Dept Comp Sci, Beijing 100190, Peoples R China
基金
中国国家自然科学基金;
关键词
Microservice architectures; Measurement; Electronic mail; Prevention and mitigation; Fuses; Monitoring; Anomaly detection; Accuracy; Time factors; Microservice; failure diagnosis; multimodal data; knowledge graph;
D O I
10.1109/TSC.2024.3489444
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Microservices improve the scalability and flexibility of monolithic architectures to accommodate the evolution of software systems, but the complexity and dynamics of microservices challenge system reliability. Ensuring microservice quality requires efficient failure diagnosis, including detection and triage. Failure detection involves identifying anomalous behavior within the system, while triage entails classifying the failure type and directing it to the engineering team for resolution. Unfortunately, current approaches reliant on single-modal monitoring data, such as metrics, logs, or traces, cannot capture all failures and neglect interconnections among multimodal data, leading to erroneous diagnoses. Recent multimodal data fusion studies struggle to achieve deep integration, limiting diagnostic accuracy due to insufficiently captured interdependencies. Therefore, we propose UniDiag, which leverages temporal knowledge graphs to fuse multimodal data for effective failure diagnosis. UniDiag applies a simple yet effective stream-based anomaly detection method to reduce computational cost and a novel microservice-oriented graph embedding method to represent the state of systems comprehensively. To assess the performance of UniDiag, we conduct extensive evaluation experiments using datasets from two benchmark microservice systems, demonstrating its superiority over existing methods and affirming the efficacy of multimodal data fusion. Additionally, we have publicly made the code and data available to facilitate further research.
引用
收藏
页码:4013 / 4026
页数:14
相关论文
共 65 条
[1]  
Colanzi T., Et al., Arewe speaking the industry language? The practice and literature of modernizing legacy systemswith microservices, Proc. 15th Braz. Symp. Softw. Compon. Architectures Reuse, pp. 61-70, (2021)
[2]  
Zhou X., Et al., Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study, IEEE Trans. Softw. Eng., 47, 2, pp. 243-260, (2021)
[3]  
Mahimkar A., De Andrade C.E., Sinha R., Rana G., A composition framework for change management, Proc. 2021 ACM SIGCOMM Conf., pp. 788-806, (2021)
[4]  
Chung J.Y., Joe-Wong C., Ha S., Hong J.W.-K., Chiang M., CYRUS: Towards client-defined cloud storage, Proc. 10th Eur. Conf. Comput. Syst., pp. 1-16, (2015)
[5]  
Chen J., Et al., An empirical investigation of incident triage for online service systems, Proc. IEEE/ACM 41st Int. Conf. Softw. Eng. : Softw. Eng. Pract., pp. 111-120, (2019)
[6]  
Chen J., Et al., Continuous incident triage for large-scale online service systems, Proc. 34th IEEE/ACM Int. Conf. Autom. Softw. Eng., pp. 364-375, (2019)
[7]  
Meng Y., Et al., Localizing failure root causes in a microservice through causality inference, Proc. IEEE/ACM 28th Int. Symp. Qual. Serv., pp. 1-10, (2020)
[8]  
Ma M., Et al., Diagnosing root causes of intermittent slow queries in cloud databases, Proc. VLDB Endowment, 13, 8, pp. 1176-1189, (2020)
[9]  
Wu C., Et al., Identifying root-cause metrics for incident diagnosis in online service systems, Proc. IEEE 32nd Int. Symp. Softw. Rel. Eng., pp. 91-102, (2021)
[10]  
Li Z., Et al., Actionable and interpretable fault localization for recurring failures in online service systems, Proc. 30th ACM Joint Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., pp. 996-1008, (2022)