Harnessing federated learning for anomaly detection in supercomputer nodes

被引:0
|
作者
Farooq, Emmen [1 ]
Milano, Michela [1 ]
Borghesi, Andrea [1 ]
机构
[1] Univ Bologna, DISI, Bologna, Italy
来源
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2024年 / 161卷
关键词
Federated learning; Anomaly detection; High-performance computing; Data center; Machine learning;
D O I
10.1016/j.future.2024.07.052
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
High-performance computing (HPC) systems are a crucial component of modern society, with a significant impact in areas ranging from economics to scientific research, thanks to their unrivaled computational capabilities. For this reason, the worldwide HPC installation is steeply trending upwards, with no sign of slowing down. However, these machines are both complex, comprising millions of heterogeneous components, hard to effectively manage, and very costly (both in terms of economic investment and of energy consumption). Therefore, maximizing their productivity is of paramount importance. For instance, anomalies and faults can generate significant downtime due to the difficulty of promptly detecting them, as there are potentially many sources of issues preventing the correct functioning of computing nodes. In recent years, several data-driven methods have been proposed to automatically detect anomalies in HPC systems, exploiting the fact that modern supercomputers are typically endowed with fine-grained monitoring infrastructures, collecting data that can be used to characterize the system behavior. Thus, it is possible to teach Machine Learning (ML) models to distinguish normal and anomalous states automatically. In this paper, we contribute to this line of research with a novel intuition, namely exploiting Federated Learning (FL) to improve the accuracy of anomaly detection models for HPC nodes. Although FL is not typically exploited in the HPC context, we show that FL can boost several types of underlying ML models, from supervised to unsupervised ones. We demonstrate our approach on a production Tier-0 supercomputer hosted in Italy. Applying FL to anomaly detection improves the average f-score from 0.46 to 0.87. Our research also shows FL can reduce the data collection time required to develop a representation data set, facilitating faster deployment of anomaly detection models. ML models need 5 months of training data for efficient anomaly detection performance while using FL reduces the training set by 15 times to 1.25 weeks.
引用
收藏
页码:673 / 685
页数:13
相关论文
共 50 条
  • [21] Anomaly detection and defense techniques in federated learning: a comprehensive review
    Zhang, Chang
    Yang, Shunkun
    Mao, Lingfeng
    Ning, Huansheng
    ARTIFICIAL INTELLIGENCE REVIEW, 2024, 57 (06)
  • [22] POSTER: Decentralized Federated Learning for Internet of Things Anomaly Detection
    Lian, Zhuotao
    Su, Chunhua
    ASIA CCS'22: PROCEEDINGS OF THE 2022 ACM ASIA CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, 2022, : 1249 - 1251
  • [23] Trust-Based Anomaly Detection in Federated Edge Learning
    Zatsarenko, Raman
    Chuprov, Sergei
    Korobeinikov, Dmitrii
    Reznik, Leon
    2024 IEEE 5TH ANNUAL WORLD AI IOT CONGRESS, AIIOT 2024, 2024, : 0273 - 0279
  • [24] Identifying Backdoor Attacks in Federated Learning via Anomaly Detection
    Mi, Yuxi
    Sun, Yiheng
    Guan, Jihong
    Zhou, Shuigeng
    WEB AND BIG DATA, PT III, APWEB-WAIM 2023, 2024, 14333 : 111 - 126
  • [25] Challenges in Federated Learning Trained Anomaly Detection Applied to Hospital Data without a Baseline
    Polido, Susana
    Napoli, Otavio Oliveira
    Breternitz, Mauricio, Jr.
    de Almeida, Ana Maria
    2024 IEEE 22ND MEDITERRANEAN ELECTROTECHNICAL CONFERENCE, MELECON 2024, 2024, : 1230 - 1235
  • [26] Anomaly Detection for 5G Softwarized Infrastructures with Federated Learning
    Bin Ruba, Salah
    Yellas, Nour El-Houda
    Secci, Stefano
    2022 1ST INTERNATIONAL CONFERENCE ON 6G NETWORKING (6GNET), 2022,
  • [27] A Personalized and Differentially Private Federated Learning for Anomaly Detection of Industrial Equipment
    Zhang, Zhen
    Zhang, Weishan
    Bao, Zhicheng
    Miao, Yifan
    Liu, Yuru
    Zhao, Yikang
    Zhang, Rui
    Zhu, Wenyin
    IEEE JOURNAL OF RADIO FREQUENCY IDENTIFICATION, 2024, 8 : 468 - 475
  • [28] Federated Graph Anomaly Detection via Contrastive Self-Supervised Learning
    Kong, Xiangjie
    Zhang, Wenyi
    Wang, Hui
    Hou, Mingliang
    Chen, Xin
    Yan, Xiaoran
    Das, Sajal K.
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, : 1 - 14
  • [29] Federated Learning-Based Explainable Anomaly Detection for Industrial Control Systems
    Huong, Truong Thu
    Bac, Ta Phuong
    Ha, Kieu Ngan
    Hoang, Nguyen Viet
    Hoang, Nguyen Xuan
    Hung, Nguyen Tai
    Tran, Kim Phuc
    IEEE ACCESS, 2022, 10 : 53854 - 53872
  • [30] Communication-Efficient Federated Learning for Anomaly Detection in Industrial Internet of Things
    Liu, Yi
    Kumar, Neeraj
    Xiong, Zehui
    Lim, Wei Yang Bryan
    Kang, Jiawen
    Niyato, Dusit
    2020 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM), 2020,