Proctor: A Semi-Supervised Performance Anomaly Diagnosis Framework for Production HPC Systems

被引:14
作者
Aksar, Burak [1 ]
Zhang, Yijia [1 ]
Ates, Emre [1 ]
Schwaller, Benjamin [2 ]
Aaziz, Omar [2 ]
Leung, Vitus J. [2 ]
Brandt, Jim [2 ]
Egele, Manuel [1 ]
Coskun, Ayse K. [1 ]
机构
[1] Boston Univ, Boston, MA 02215 USA
[2] Sandia Natl Labs, Albuquerque, NM 87123 USA
来源
HIGH PERFORMANCE COMPUTING, ISC HIGH PERFORMANCE 2021 | 2021年 / 12728卷
关键词
Anomaly diagnosis; Semi-supervised learning; High performance computing; INFRASTRUCTURE;
D O I
10.1007/978-3-030-78713-4_11
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Performance variation diagnosis in High-Performance Computing (HPC) systems is a challenging problem due to the size and complexity of the systems. Application performance variation leads to premature termination of jobs, decreased energy efficiency, or wasted computing resources. Manual root-cause analysis of performance variation based on system telemetry has become an increasingly time-intensive process as it relies on human experts and the size of telemetry data has grown. Recent methods use supervised machine learning models to automatically diagnose previously encountered performance anomalies in compute nodes. However, supervised machine learning models require large labeled data sets for training. This labeled data requirement is restrictive for many real-world application domains, including HPC systems, because collecting labeled data is challenging and time-consuming, especially considering anomalies that sparsely occur. This paper proposes a novel semi-supervised framework that diagnoses previously encountered performance anomalies in HPC systems using a limited number of labeled data points, which is more suitable for production system deployment. Our framework first learns performance anomalies' characteristics by using historical telemetry data in an unsupervised fashion. In the following process, we leverage supervised classifiers to identify anomaly types. While most semi-supervised approaches do not typically use anomalous samples, our framework takes advantage of a few labeled anomalous samples to classify anomaly types. We evaluate our framework on a production HPC system and on a testbed HPC cluster. We show that our proposed framework achieves 60% F1-score on average, outperforming state-of-the-art supervised methods by 11%, and maintains an average 0.06% anomaly miss rate.
引用
收藏
页码:195 / 214
页数:20
相关论文
共 46 条
[1]   Design of an Energy Aware Petaflops Class High Performance Cluster Based on Power Architecture [J].
Abu Ahmad, Wissam ;
Bartolini, Andrea ;
Beneventi, Francesco ;
Benini, Luca ;
Borghesi, Andrea ;
Cicala, Marco ;
Forestieri, Privato ;
Gianfreda, Cosimo ;
Gregori, Daniele ;
Libri, Antonio ;
Spiga, Filippo ;
Tinti, Simone .
2017 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2017, :964-973
[2]   Toward Rapid Understanding of Production HPC Applications and Systems [J].
Agelastos, Anthony ;
Allan, Benjamin ;
Brandt, Jim ;
Gentile, Ann ;
Lefantzi, Sophia ;
Monk, Steve ;
Ogden, Jeff ;
Rajan, Mahesh ;
Stevenson, Joel .
2015 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING - CLUSTER 2015, 2015, :464-473
[3]   The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications [J].
Agelastos, Anthony ;
Allan, Benjamin ;
Brandt, Jim ;
Cassella, Paul ;
Enos, Jeremy ;
Fullop, Joshi ;
Gentile, Ann ;
Monk, Steve ;
Naksinehaboon, Nichamon ;
Ogden, Jeff ;
Rajan, Mahesh ;
Showerman, Michael ;
Stevenson, Joel ;
Taerat, Narate ;
Tucker, Tom .
SC14: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2014, :154-165
[4]  
Agrawal P, 2014, LECT NOTES COMPUT SC, V8695, P329, DOI 10.1007/978-3-319-10584-0_22
[5]  
Alain G, 2014, J MACH LEARN RES, V15, P3563
[6]   Taxonomist: Application Detection Through Rich Monitoring Data [J].
Ates, Emre ;
Tuncer, Ozan ;
Turk, Ata ;
Leung, Vitus J. ;
Brandt, Jim ;
Egele, Manuel ;
Coskun, Ayse K. .
EURO-PAR 2018: PARALLEL PROCESSING, 2018, 11014 :92-105
[7]   HPAS: An HPC Performance Anomaly Suite for Reproducing Performance Variations [J].
Ates, Emre ;
Zhang, Yijia ;
Aksar, Burak ;
Brandt, Jim ;
Leung, Vitus J. ;
Egele, Manuel ;
Coskun, Ayse K. .
PROCEEDINGS OF THE 48TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP 2019), 2019,
[8]  
BAILEY DH, 1991, SUPERCOMPUTING 91, P158
[9]  
Baseman E., 2016, OUTL DEF DET DESCR D
[10]  
Beneventi F, 2017, DES AUT TEST EUROPE, P1038, DOI 10.23919/DATE.2017.7927143