Harnessing federated learning for anomaly detection in supercomputer nodes

被引:0
|
作者
Farooq, Emmen [1 ]
Milano, Michela [1 ]
Borghesi, Andrea [1 ]
机构
[1] Univ Bologna, DISI, Bologna, Italy
来源
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2024年 / 161卷
关键词
Federated learning; Anomaly detection; High-performance computing; Data center; Machine learning;
D O I
10.1016/j.future.2024.07.052
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
High-performance computing (HPC) systems are a crucial component of modern society, with a significant impact in areas ranging from economics to scientific research, thanks to their unrivaled computational capabilities. For this reason, the worldwide HPC installation is steeply trending upwards, with no sign of slowing down. However, these machines are both complex, comprising millions of heterogeneous components, hard to effectively manage, and very costly (both in terms of economic investment and of energy consumption). Therefore, maximizing their productivity is of paramount importance. For instance, anomalies and faults can generate significant downtime due to the difficulty of promptly detecting them, as there are potentially many sources of issues preventing the correct functioning of computing nodes. In recent years, several data-driven methods have been proposed to automatically detect anomalies in HPC systems, exploiting the fact that modern supercomputers are typically endowed with fine-grained monitoring infrastructures, collecting data that can be used to characterize the system behavior. Thus, it is possible to teach Machine Learning (ML) models to distinguish normal and anomalous states automatically. In this paper, we contribute to this line of research with a novel intuition, namely exploiting Federated Learning (FL) to improve the accuracy of anomaly detection models for HPC nodes. Although FL is not typically exploited in the HPC context, we show that FL can boost several types of underlying ML models, from supervised to unsupervised ones. We demonstrate our approach on a production Tier-0 supercomputer hosted in Italy. Applying FL to anomaly detection improves the average f-score from 0.46 to 0.87. Our research also shows FL can reduce the data collection time required to develop a representation data set, facilitating faster deployment of anomaly detection models. ML models need 5 months of training data for efficient anomaly detection performance while using FL reduces the training set by 15 times to 1.25 weeks.
引用
收藏
页码:673 / 685
页数:13
相关论文
共 50 条
  • [1] A Federated Learning Approach for Anomaly Detection in High Performance Computing
    Farooq, Emmen
    Borghesi, Andrea
    2023 IEEE 35TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2023, : 496 - 500
  • [2] Network Anomaly Detection Using Federated Learning
    Marfo, William
    Tosh, Deepak K.
    Moore, Shirley V.
    2022 IEEE MILITARY COMMUNICATIONS CONFERENCE (MILCOM), 2022,
  • [3] Enhancing IoT anomaly detection performance for federated learning
    Weinger, Brett
    Kim, Jinoh
    Sim, Alex
    Nakashima, Makiya
    Moustafa, Nour
    Wu, K. John
    DIGITAL COMMUNICATIONS AND NETWORKS, 2022, 8 (03) : 314 - 323
  • [4] Enhancing IoT Anomaly Detection Performance for Federated Learning
    Weinger, Brett
    Kim, Jinoh
    Sim, Alex
    Nakashima, Makiya
    Moustafa, Nour
    Wu, K. John
    2020 16TH INTERNATIONAL CONFERENCE ON MOBILITY, SENSING AND NETWORKING (MSN 2020), 2020, : 206 - 213
  • [5] FedGroup: A Federated Learning Approach for Anomaly Detection in IoT Environments
    Zhang, Yixuan
    Suleiman, Basem
    Alibasa, Muhammad Johan
    MOBILE AND UBIQUITOUS SYSTEMS: COMPUTING, NETWORKING AND SERVICES, MOBIQUITOUS 2022, 2023, 492 : 121 - 132
  • [6] Collaborative Anomaly Detection for Internet of Things based on Federated Learning
    Kim, Seongwoo
    Cai, He
    Hua, Cunqing
    Gu, Pengwenlong
    Xu, Wenchao
    Park, Jeonghyeok
    2020 IEEE/CIC INTERNATIONAL CONFERENCE ON COMMUNICATIONS IN CHINA (ICCC), 2020, : 623 - 628
  • [7] Trust-based federated learning for network anomaly detection
    Chen, Naiyue
    Jin, Yi
    Li, Yinglong
    Cai, Luxin
    WEB INTELLIGENCE, 2021, 19 (04) : 317 - 327
  • [8] Anomaly Detection through Unsupervised Federated Learning
    Nardi, Mirko
    Valerio, Lorenzo
    Passarella, Andrea
    2022 18TH INTERNATIONAL CONFERENCE ON MOBILITY, SENSING AND NETWORKING, MSN, 2022, : 495 - 501
  • [9] Federated Learning for Anomaly Detection in Vehicular Networks
    Tham, Chen-Khong
    Yang, Lu
    Khanna, Akshit
    Gera, Bhavya
    2023 IEEE 97TH VEHICULAR TECHNOLOGY CONFERENCE, VTC2023-SPRING, 2023,
  • [10] Communication-Efficient Federated Learning for Network Traffic Anomaly Detection
    Cui, Xiao
    Han, Xiaohui
    Liu, Guangqi
    Zuo, Wenbo
    Wang, Zhiwen
    2023 19TH INTERNATIONAL CONFERENCE ON MOBILITY, SENSING AND NETWORKING, MSN 2023, 2023, : 398 - 405