A Federated Learning Approach for Anomaly Detection in High Performance Computing

被引:4
作者
Farooq, Emmen [1 ]
Borghesi, Andrea [1 ]
机构
[1] Univ Bologna, DISI, Bologna, Italy
来源
2023 IEEE 35TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI | 2023年
关键词
Federated Learning; High Performance Computing; Anomaly Detection; Machine Learning;
D O I
10.1109/ICTAI59109.2023.00079
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
High Performance Computing (HPC) systems are complex machines that need to be operated at their maximum potential to recoup their investment cost and to mitigate their environmental impact. Anomalous conditions hindering the correct usage of the supercomputing nodes are a significant problem. Hence, the development of automated anomaly detection techniques remains a vital area of research. Machine Learning (ML) models demonstrated to be good at detecting anomalies on individual nodes. However, the potential of combining data from multiple computing nodes and associated ML models has not been explored yet. Federated Learning (FL) can address this shortcoming, by allowing individual models to learn from each other. This paper applies FL to improve the performance of anomaly detection models for HPC systems. The approach has been validated on data from an actual supercomputer, obtaining an improvement in the average f-score from 0.31 to 0.84. We also show how FL can significantly shorten the data collection period needed to create a training set. While ML models need, on average, 4.5 months of training data, FL reduces the training set size to 1.2 weeks - a 15x reduction.
引用
收藏
页码:496 / 500
页数:5
相关论文
共 14 条
  • [1] Aksar Burak, 2021, High Performance Computing. 36th International Conference, ISC High Performance 2021. Lecture Notes in Computer Science (LNCS 12728), P195, DOI 10.1007/978-3-030-78713-4_11
  • [2] Paving theWay Toward Energy-Aware and Automated Datacentre
    Bartolini, Andrea
    Beneventi, Francesco
    Borghesi, Andrea
    Cesarini, Daniele
    Libri, Antonio
    Benini, Luca
    Cavazzoni, Carlo
    [J]. PROCEEDINGS OF THE 48TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING WORKSHOPS (ICPP 2019), 2019,
  • [3] Anomaly Detection and Anticipation in High Performance Computing Systems
    Borghesi, Andrea
    Molan, Martin
    Milano, Michela
    Bartolini, Andrea
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (04) : 739 - 750
  • [4] A semisupervised autoencoder-based approach for anomaly detection in high performance computing systems
    Borghesi, Andrea
    Bartolini, Andrea
    Lombardi, Michele
    Milano, Michela
    Benini, Luca
    [J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2019, 85 : 634 - 644
  • [5] Efficient and Less Centralized Federated Learning
    Chou, Li
    Liu, Zichang
    Wang, Zhuang
    Shrivastava, Anshumali
    [J]. MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, 2021, 12975 : 772 - 787
  • [6] A Survey of Log-Correlation Tools for Failure Diagnosis and Prediction in Cluster Systems
    Chuah, Edward
    Jhumka, Arshad
    Malek, Miroslaw
    Suri, Neeraj
    [J]. IEEE ACCESS, 2022, 10 : 133487 - 133503
  • [7] ETP4HPC, 2017, Strategic research agenda
  • [8] RUAD: Unsupervised anomaly detection in HPC systems
    Molan, Martin
    Borghesi, Andrea
    Cesarini, Daniele
    Benini, Luca
    Bartolini, Andrea
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2023, 141 : 542 - 554
  • [9] Anomaly detection and virtual reality visualisation in supercomputers
    Mulero-Perez, David
    Benavent-Lledo, Manuel
    Azorin-Lopez, Jorge
    Marcos-Jorquera, Diego
    Garcia-Rodriguez, Jose
    [J]. INTERNATIONAL JOURNAL OF ADVANCED MANUFACTURING TECHNOLOGY, 2024, 133 (1-2) : 935 - 947
  • [10] Netti A., 2019, FUTURE GENER COMP SY