Real-Time Anomaly Detection Using Distributed Tracing in Microservice Cloud Applications

被引:2
作者
Raeiszadeh, Mahsa [1 ]
Ebrahimzadeh, Amin [1 ]
Saleem, Ahsan [1 ]
Glitho, Roch H. [1 ,3 ]
Eker, Johan [2 ]
Mini, Raquel A. F. [2 ]
机构
[1] Concordia Univ, CIISE, Montreal, PQ, Canada
[2] Ericsson Res, Lund, Sweden
[3] Univ Western Cape, Comp Sci Programme, Cape Town, South Africa
来源
2023 IEEE 12TH INTERNATIONAL CONFERENCE ON CLOUD NETWORKING, CLOUDNET | 2023年
关键词
Anomaly Detection; Distributed Tracing; Microservice; Positive and Unlabeled Learning;
D O I
10.1109/CloudNet59005.2023.10490038
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Distributed tracing plays a vital role in microservice infrastructure, and learning-based trace analysis has been utilized to detect anomalies within such systems. However, existing approaches for learning-based trace-based anomaly detection face certain limitations. Some assume that trace patterns can be learned solely from normal executions, while others depend on anomaly injection to generate labeled traces categorized as normal or anomalous. However, in practical scenarios, anomalies may also happen during the normal execution. Moreover, a wide variety of anomalies may occur in practice, which cannot be captured solely through anomaly injection. To address these issues, we propose a Trace-Driven Anomaly Detection (TDAD) approach based on a Span Causal Graph (SCG) representation, which trains a model using a Graph Neural Network (GNN) and Positive and Unlabeled (PU) learning. This technique allows the model parameters to be optimized by estimating the underlying data distribution. As a result, TDAD can be effectively trained using a small number of labeled anomalous traces along with a relatively large number of unlabeled traces. Our evaluation reveals that TDAD outperforms not only the existing unsupervised trace-based anomaly detection methods by 11.9% in terms of F-1-score but also a supervised learning-based benchmark by 12x in terms of detection time.
引用
收藏
页码:36 / 44
页数:9
相关论文
共 22 条
[1]   PPTAM: Production and Performance Testing Based Application Monitoring [J].
Avritzer, Alberto ;
Menasche, Daniel ;
Rufino, Vilc ;
Russo, Barbara ;
Janes, Andrea ;
Ferme, Vincenzo ;
van Hoorn, Andre ;
Schulz, Henning .
COMPANION OF THE 2019 ACM/SPEC INTERNATIONAL CONFERENCE ON PERFORMANCE ENGINEERING (ICPE '19), 2019, :39-40
[2]   Self-Supervised Anomaly Detection from Distributed Traces [J].
Bogatinovski, Jasmin ;
Nedelkoski, Sasho ;
Cardoso, Jorge ;
Kao, Odej .
2020 IEEE/ACM 13TH INTERNATIONAL CONFERENCE ON UTILITY AND CLOUD COMPUTING (UCC 2020), 2020, :342-347
[3]   A Framework of Virtual War Room and Matrix Sketch-Based Streaming Anomaly Detection for Microservice Systems [J].
Chen, Hongyang ;
Chen, Pengfei ;
Yu, Guangba .
IEEE ACCESS, 2020, 8 (08) :43413-43426
[4]   CauseInfer: Automated End-to-End Performance Diagnosis with Hierarchical Causality Graph in Cloud Environment [J].
Chen, Pengfei ;
Qi, Yong ;
Hou, Di .
IEEE TRANSACTIONS ON SERVICES COMPUTING, 2019, 12 (02) :214-230
[5]  
Davidson T., 2023, IEEE Transactions on Visualization and Computer Graphics
[6]  
Devlin J, 2019, Arxiv, DOI arXiv:1810.04805
[7]  
Fielding R., 2014, document RFC 7231
[8]   Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices [J].
Gan, Yu ;
Zhang, Yanqi ;
Hu, Kelvin ;
Cheng, Dailun ;
He, Yuan ;
Pancholi, Meghna ;
Delimitrou, Christina .
TWENTY-FOURTH INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS (ASPLOS XXIV), 2019, :19-33
[9]   An Anomaly Detection Algorithm for Microservice Architecture Based on Robust Principal Component Analysis [J].
Jin, Mingxu ;
Lv, Aoran ;
Zhu, Yuanpeng ;
Wen, Zijiang ;
Zhong, Yubin ;
Zhao, Zexin ;
Wu, Jiang ;
Li, Hejie ;
He, Hanheng ;
Chen, Fengyi .
IEEE ACCESS, 2020, 8 :226397-226408
[10]  
King DB, 2015, ACS SYM SER, V1214, P1, DOI 10.1021/bk-2015-1214.ch001