Proctor: A Semi-Supervised Performance Anomaly Diagnosis Framework for Production HPC Systems

被引:14
作者
Aksar, Burak [1 ]
Zhang, Yijia [1 ]
Ates, Emre [1 ]
Schwaller, Benjamin [2 ]
Aaziz, Omar [2 ]
Leung, Vitus J. [2 ]
Brandt, Jim [2 ]
Egele, Manuel [1 ]
Coskun, Ayse K. [1 ]
机构
[1] Boston Univ, Boston, MA 02215 USA
[2] Sandia Natl Labs, Albuquerque, NM 87123 USA
来源
HIGH PERFORMANCE COMPUTING, ISC HIGH PERFORMANCE 2021 | 2021年 / 12728卷
关键词
Anomaly diagnosis; Semi-supervised learning; High performance computing; INFRASTRUCTURE;
D O I
10.1007/978-3-030-78713-4_11
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Performance variation diagnosis in High-Performance Computing (HPC) systems is a challenging problem due to the size and complexity of the systems. Application performance variation leads to premature termination of jobs, decreased energy efficiency, or wasted computing resources. Manual root-cause analysis of performance variation based on system telemetry has become an increasingly time-intensive process as it relies on human experts and the size of telemetry data has grown. Recent methods use supervised machine learning models to automatically diagnose previously encountered performance anomalies in compute nodes. However, supervised machine learning models require large labeled data sets for training. This labeled data requirement is restrictive for many real-world application domains, including HPC systems, because collecting labeled data is challenging and time-consuming, especially considering anomalies that sparsely occur. This paper proposes a novel semi-supervised framework that diagnoses previously encountered performance anomalies in HPC systems using a limited number of labeled data points, which is more suitable for production system deployment. Our framework first learns performance anomalies' characteristics by using historical telemetry data in an unsupervised fashion. In the following process, we leverage supervised classifiers to identify anomaly types. While most semi-supervised approaches do not typically use anomalous samples, our framework takes advantage of a few labeled anomalous samples to classify anomaly types. We evaluate our framework on a production HPC system and on a testbed HPC cluster. We show that our proposed framework achieves 60% F1-score on average, outperforming state-of-the-art supervised methods by 11%, and maintains an average 0.06% anomaly miss rate.
引用
收藏
页码:195 / 214
页数:20
相关论文
共 46 条
[31]  
Luo T, 2018, IEEE ICC
[32]  
Nair V., 2010, P 27 INT C MACH LEAR, P807
[33]  
Petersson Anders N, 2014, Zenodo, DOI 10.5281/zenodo.571844
[34]   FAST PARALLEL ALGORITHMS FOR SHORT-RANGE MOLECULAR-DYNAMICS [J].
PLIMPTON, S .
JOURNAL OF COMPUTATIONAL PHYSICS, 1995, 117 (01) :1-19
[35]  
Proxyapps.exascaleproject, Exascale proxy applications
[36]   A primitive study on unsupervised anomaly detection with an autoencoder in emergency head CT volumes [J].
Sato, Daisuke ;
Hanaoka, Shouhei ;
Nomura, Yukihiro ;
Takenaga, Tomomi ;
Miki, Soichiro ;
Yoshikawa, Takeharu ;
Hayashi, Naoto ;
Abe, Osamu .
MEDICAL IMAGING 2018: COMPUTER-AIDED DIAGNOSIS, 2018, 10575
[37]   HPC System Data Pipeline to Enable Meaningful Insights through Analysis-Driven Visualizations [J].
Schwaller, Benjamin ;
Tucker, Nick ;
Tucker, Tom ;
Allan, Benjamin ;
Brandt, Jim .
2020 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2020), 2020, :433-441
[38]  
Minhas MS, 2020, Arxiv, DOI arXiv:2001.03674
[39]   Addressing failures in exascale computing [J].
Snir, Marc ;
Wisniewski, Robert W. ;
Abraham, Jacob A. ;
Adve, Sarita V. ;
Bagchi, Saurabh ;
Balaji, Pavan ;
Belak, Jim ;
Bose, Pradip ;
Cappello, Franck ;
Carlson, Bill ;
Chien, Andrew A. ;
Coteus, Paul ;
DeBardeleben, Nathan A. ;
Diniz, Pedro C. ;
Engelmann, Christian ;
Erez, Mattan ;
Fazzari, Saverio ;
Geist, Al ;
Gupta, Rinku ;
Johnson, Fred ;
Krishnamoorthy, Sriram ;
Leyffer, Sven ;
Liberty, Dean ;
Mitra, Subhasish ;
Munson, Todd ;
Schreiber, Rob ;
Stearley, Jon ;
Van Hensbergen, Eric .
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2014, 28 (02) :129-173
[40]   A Hybrid Semi-Supervised Anomaly Detection Model for High-Dimensional Data [J].
Song, Hongchao ;
Jiang, Zhuqing ;
Men, Aidong ;
Yang, Bo .
COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2017, 2017