Asynchronous Real-Time Federated Learning for Anomaly Detection in Microservice Cloud Applications

被引:0
作者
Raeiszadeh, Mahsa [1 ]
Ebrahimzadeh, Amin [1 ]
Glitho, Roch H. [1 ]
Eker, Johan [2 ]
Mini, Raquel A. F. [2 ]
机构
[1] Concordia Univ, CIISE, Montreal, PQ H3G 1M8, Canada
[2] Ericsson Res, S-22362 Lund, Sweden
来源
IEEE TRANSACTIONS ON MACHINE LEARNING IN COMMUNICATIONS AND NETWORKING | 2025年 / 3卷
关键词
Microservice architectures; Anomaly detection; Real-time systems; Computational modeling; Machine learning; Data models; Computer architecture; Image edge detection; Federated learning; Servers; distributed data; federated learning; microservice; trace;
D O I
10.1109/TMLCN.2025.3527919
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The complexity and dynamicity of microservice architectures in cloud environments present substantial challenges to the reliability and availability of the services built on these architectures. Therefore, effective anomaly detection is crucial to prevent impending failures and resolve them promptly. Distributed data analysis techniques based on machine learning (ML) have recently gained attention in detecting anomalies in microservice systems. ML-based anomaly detection techniques mostly require centralized data collection and processing, which may raise scalability and computational issues in practice. In this paper, we propose an Asynchronous Real-Time Federated Learning (ART-FL) approach for anomaly detection in cloud-based microservice systems. In our approach, edge clients perform real-time learning with continuous streaming local data. At the edge clients, we model intra-service behaviors and inter-service dependencies in multi-source distributed data based on a Span Causal Graph (SCG) representation and train a model through a combination of Graph Neural Network (GNN) and Positive and Unlabeled (PU) learning. Our FL approach updates the global model in an asynchronous manner to achieve accurate and efficient anomaly detection, addressing computational overhead across diverse edge clients, including those that experience delays. Our trace-driven evaluations indicate that the proposed method outperforms the state-of-the-art anomaly detection methods by 4% in terms of F-1 -score while meeting the given time efficiency and scalability requirements.
引用
收藏
页码:176 / 194
页数:19
相关论文
共 44 条
[1]   PPTAM: Production and Performance Testing Based Application Monitoring [J].
Avritzer, Alberto ;
Menasche, Daniel ;
Rufino, Vilc ;
Russo, Barbara ;
Janes, Andrea ;
Ferme, Vincenzo ;
van Hoorn, Andre ;
Schulz, Henning .
COMPANION OF THE 2019 ACM/SPEC INTERNATIONAL CONFERENCE ON PERFORMANCE ENGINEERING (ICPE '19), 2019, :39-40
[2]   Self-Supervised Anomaly Detection from Distributed Traces [J].
Bogatinovski, Jasmin ;
Nedelkoski, Sasho ;
Cardoso, Jorge ;
Kao, Odej .
2020 IEEE/ACM 13TH INTERNATIONAL CONFERENCE ON UTILITY AND CLOUD COMPUTING (UCC 2020), 2020, :342-347
[3]   CauseInfer: Automated End-to-End Performance Diagnosis with Hierarchical Causality Graph in Cloud Environment [J].
Chen, Pengfei ;
Qi, Yong ;
Hou, Di .
IEEE TRANSACTIONS ON SERVICES COMPUTING, 2019, 12 (02) :214-230
[4]   Security and Privacy-Enhanced Federated Learning for Anomaly Detection in IoT Infrastructures [J].
Cui, Lei ;
Qu, Youyang ;
Xie, Gang ;
Zeng, Deze ;
Li, Ruidong ;
Shen, Shigen ;
Yu, Shui .
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2022, 18 (05) :3492-3500
[5]   A Qualitative Interview Study of Distributed Tracing Visualisation: A Characterisation of Challenges and Opportunities [J].
Davidson, Thomas ;
Wall, Emily ;
Mace, Jonathan .
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2024, 30 (07) :3828-3840
[6]  
Fielding R., 2014, Hypertext Transfer Protocol (HTTP/1.1): Semantics and content, DOI [10.17487/RFC7235, DOI 10.17487/RFC7235]
[7]   Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices [J].
Gan, Yu ;
Zhang, Yanqi ;
Hu, Kelvin ;
Cheng, Dailun ;
He, Yuan ;
Pancholi, Meghna ;
Delimitrou, Christina .
TWENTY-FOURTH INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS (ASPLOS XXIV), 2019, :19-33
[8]  
Guo Y., 2021, P IEEE INT JOINT C N, P1
[9]   Towards Automated Log Parsing for Large-Scale Log Data Analysis [J].
He, Pinjia ;
Zhu, Jieming ;
He, Shilin ;
Li, Jian ;
Lyu, Michael R. .
IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2018, 15 (06) :931-944
[10]   Drain: An Online Log Parsing Approach with Fixed Depth Tree [J].
He, Pinjia ;
Zhu, Jieming ;
Zheng, Zibin ;
Lyu, Michael R. .
2017 IEEE 24TH INTERNATIONAL CONFERENCE ON WEB SERVICES (ICWS 2017), 2017, :33-40