Harnessing federated learning for anomaly detection in supercomputer nodes

被引:0
|
作者
Farooq, Emmen [1 ]
Milano, Michela [1 ]
Borghesi, Andrea [1 ]
机构
[1] Univ Bologna, DISI, Bologna, Italy
来源
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2024年 / 161卷
关键词
Federated learning; Anomaly detection; High-performance computing; Data center; Machine learning;
D O I
10.1016/j.future.2024.07.052
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
High-performance computing (HPC) systems are a crucial component of modern society, with a significant impact in areas ranging from economics to scientific research, thanks to their unrivaled computational capabilities. For this reason, the worldwide HPC installation is steeply trending upwards, with no sign of slowing down. However, these machines are both complex, comprising millions of heterogeneous components, hard to effectively manage, and very costly (both in terms of economic investment and of energy consumption). Therefore, maximizing their productivity is of paramount importance. For instance, anomalies and faults can generate significant downtime due to the difficulty of promptly detecting them, as there are potentially many sources of issues preventing the correct functioning of computing nodes. In recent years, several data-driven methods have been proposed to automatically detect anomalies in HPC systems, exploiting the fact that modern supercomputers are typically endowed with fine-grained monitoring infrastructures, collecting data that can be used to characterize the system behavior. Thus, it is possible to teach Machine Learning (ML) models to distinguish normal and anomalous states automatically. In this paper, we contribute to this line of research with a novel intuition, namely exploiting Federated Learning (FL) to improve the accuracy of anomaly detection models for HPC nodes. Although FL is not typically exploited in the HPC context, we show that FL can boost several types of underlying ML models, from supervised to unsupervised ones. We demonstrate our approach on a production Tier-0 supercomputer hosted in Italy. Applying FL to anomaly detection improves the average f-score from 0.46 to 0.87. Our research also shows FL can reduce the data collection time required to develop a representation data set, facilitating faster deployment of anomaly detection models. ML models need 5 months of training data for efficient anomaly detection performance while using FL reduces the training set by 15 times to 1.25 weeks.
引用
收藏
页码:673 / 685
页数:13
相关论文
共 50 条
  • [31] Taking Advantage of the Mistakes: Rethinking Clustered Federated Learning for IoT Anomaly Detection
    Fan, Jiamin
    Wu, Kui
    Tang, Guoming
    Zhou, Yang
    Huang, Shengqiang
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2024, 35 (06) : 707 - 721
  • [32] A Federated Learning Approach for Efficient Anomaly Detection in Electric Power Steering Systems
    Kea, Kimleang
    Han, Youngsun
    Min, Young-Jae
    IEEE ACCESS, 2024, 12 : 67525 - 67536
  • [33] Lightweight anomaly detection in federated learning via separable convolution and convergence acceleration
    Jiang, Bin
    Wang, Guangfeng
    Cui, Xuerong
    Luo, Fei
    Wang, Jian
    INTERNET OF THINGS, 2025, 30
  • [34] Federated Learning Framework for Collaborative Time Series Anomaly Detection on Distributed Machines
    Iwan, Ignatius
    Bukit, Tori Andika
    Yahya, Bernardo Nugroho
    Lee, Seok-Lyong
    2024 IEEE 48TH ANNUAL COMPUTERS, SOFTWARE, AND APPLICATIONS CONFERENCE, COMPSAC 2024, 2024, : 1665 - 1670
  • [35] An Improved Sensor Anomaly Detection Method in IoT System using Federated Learning
    Tran, Duc Hoang
    Nguyen, Van Linh
    Utama, Ida Bagus Krishna Yoga
    Jang, Yeong Min
    2022 THIRTEENTH INTERNATIONAL CONFERENCE ON UBIQUITOUS AND FUTURE NETWORKS (ICUFN), 2022, : 466 - 469
  • [36] Anomaly Detection using Distributed Log Data: A Lightweight Federated Learning Approach
    Guo, Yalan
    Wu, Yulei
    Zhu, Yanchao
    Yang, Bingqiang
    Han, Chunjing
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [37] The Adaptive Personalized Federated Meta-Learning for Anomaly Detection of Industrial Equipment
    Liu, Yuange
    Bao, Zhicheng
    Wang, Yuqian
    Zeng, Xingjie
    Xu, Liang
    Zhang, Weishan
    Zhao, Hongwei
    Yu, Zepei
    IEEE JOURNAL OF RADIO FREQUENCY IDENTIFICATION, 2022, 6 : 832 - 836
  • [38] Utility Analysis about Log Data Anomaly Detection Based on Federated Learning
    Shin, Tae-Ho
    Kim, Soo-Hyung
    APPLIED SCIENCES-BASEL, 2023, 13 (07):
  • [39] Anomaly Traffic Detection with Federated Learning toward Network-based Malware Detection in IoT
    Nishio, Takayuki
    Nakahara, Masataka
    Okui, Norihiro
    Kubota, Ayumu
    Kobayashi, Yasuaki
    Sugiyama, Keizo
    Shinkuma, Ryoichi
    2022 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM 2022), 2022, : 299 - 304
  • [40] Federated Learning and Neural Circuit Policies: A Novel Framework for Anomaly Detection in Energy-Intensive Machinery
    Palma, Giulia
    Geraci, Giovanni
    Rizzo, Antonio
    ENERGIES, 2025, 18 (04)