Analysing Supercomputer Nodes Behaviour with the Latent Representation of Deep Learning Models

被引:1
作者
Molan, Martin [1 ]
Borghesi, Andrea [1 ]
Benini, Luca [1 ,2 ]
Bartolini, Andrea [1 ]
机构
[1] Univ Bologna, DISI & DEI Dept, Bologna, Italy
[2] ETH, Inst Integrierte Syst, Zurich, Switzerland
来源
EURO-PAR 2022: PARALLEL PROCESSING | 2022年 / 13440卷
关键词
supercomputer monitoring; deep Learning; unsupervised learning; autoencoders; predictive maintenance; ANOMALY DETECTION; DIAGNOSIS;
D O I
10.1007/978-3-031-12597-3_11
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Anomaly detection systems are vital in ensuring the availability of modern High-Performance Computing (HPC) systems, where many components can fail or behave wrongly. Building a data-driven representation of the computing nodes can help with predictive maintenance and facility management. Luckily, most of the current supercomputers are endowed with monitoring frameworks that can build such representations in conjunction with Deep Learning (DL) models. In this work, we propose a novel semi-supervised DL approach based on autoencoder networks and clustering algorithms (applied to the latent representation) to build a digital twin of the computing nodes of the system. The DL model projects the node features into a lower-dimensional space. Then, clustering is applied to capture and reveal underlying, non-trivial correlations between the features. The extracted information provides valuable insights for system administrators and managers, such as anomaly detection and node classification based on their behaviour and operative conditions. We validated the approach on 240 nodes from the Marconi 100 system, a Tier-0 supercomputer located in CINECA (Italy), considering a 10-month period.
引用
收藏
页码:171 / 185
页数:15
相关论文
共 32 条
[1]   E2EWatch: An End-to-End Anomaly Diagnosis Framework for Production HPC Systems [J].
Aksar, Burak ;
Schwaller, Benjamin ;
Aaziz, Omar ;
Leung, Vitus J. ;
Brandt, Jim ;
Egele, Manuel ;
Coskun, Ayse K. .
EURO-PAR 2021: PARALLEL PROCESSING, 2021, 12820 :70-85
[2]   Proctor: A Semi-Supervised Performance Anomaly Diagnosis Framework for Production HPC Systems [J].
Aksar, Burak ;
Zhang, Yijia ;
Ates, Emre ;
Schwaller, Benjamin ;
Aaziz, Omar ;
Leung, Vitus J. ;
Brandt, Jim ;
Egele, Manuel ;
Coskun, Ayse K. .
HIGH PERFORMANCE COMPUTING, ISC HIGH PERFORMANCE 2021, 2021, 12728 :195-214
[3]   Comprehensive review on Clustering Techniques and its application on High Dimensional Data [J].
Alam, Afroj ;
Muqeem, Mohd ;
Ahmad, Sultan .
INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2021, 21 (06) :237-244
[4]  
[Anonymous], 2013, P 18 IB C PATT REC H, DOI [DOI 10.1007/978-3-642-41822-8, DOI 10.1007/978-3-642-41822-815]
[5]  
Bank D., 2020, arXiv
[6]   Paving theWay Toward Energy-Aware and Automated Datacentre [J].
Bartolini, Andrea ;
Beneventi, Francesco ;
Borghesi, Andrea ;
Cesarini, Daniele ;
Libri, Antonio ;
Benini, Luca ;
Cavazzoni, Carlo .
PROCEEDINGS OF THE 48TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING WORKSHOPS (ICPP 2019), 2019,
[7]  
Belitskii G, 2013, Matrix Norms and Their Applications, V36
[8]  
Borghesi A., 2019, P AAAI C ARTIFICIAL
[9]   Anomaly Detection and Anticipation in High Performance Computing Systems [J].
Borghesi, Andrea ;
Molan, Martin ;
Milano, Michela ;
Bartolini, Andrea .
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (04) :739-750
[10]   A semisupervised autoencoder-based approach for anomaly detection in high performance computing systems [J].
Borghesi, Andrea ;
Bartolini, Andrea ;
Lombardi, Michele ;
Milano, Michela ;
Benini, Luca .
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2019, 85 :634-644