Toward an In-Depth Analysis of Multifidelity High Performance Computing Systems

被引：2

作者：

Shilpika, Shilpika ^{[1
,2
]}

Lusch, Bethany ^{[2
]}

Emani, Murali ^{[2
]}

Simini, Filippo ^{[2
]}

Vishwanath, Venkatram ^{[2
]}

Papka, Michael E. ^{[2
,3
]}

Ma, Kwan-Liu ^{[1
]}

机构：

[1] Univ Calif Davis, Dept Comp Sci, Davis, CA 95616 USA

[2] Argonne Natl Lab, Argonne Leadership Comp Facil, Argonne, IL 60439 USA

[3] Northern Illinois Univ, Dept Comp Sci, De Kalb, IL 60115 USA

来源：

2022 22ND IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2022) | 2022年

关键词：

Error Log Analysis; HPC; Visualization; Time-series Clustering; Machine Learning; Reliability;

D O I：

10.1109/CCGrid54584.2022.00081

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

To maintain a robust and reliable supercomputing facility, monitoring it and understanding its hardware system events and behaviors is an essential task. Exascale systems will be increasingly heterogeneous, and the volume of systems data, collected from multiple subsystems and components measured at multiple fidelity levels and temporal resolutions, will continue to grow. In this work, we aim to create an effective solution to analyze diverse and massive datasets gathered from the error logs, job logs, and environment logs of an HPC system, such as a Cray XC40 supercomputer. In this work, we build an end-to-end error log analysis system that analyzes the job logs and gleans insights from their correspondence with hardware error logs and environment logs despite their varying temporal and spatial resolutions. Our machine learning pipeline built in our system is similar to 92% accurate in predicting the job exit status and does so with sufficient lead time for evasive actions to be taken before the actual failure event occurs.

引用

页码：716 / 725

页数：10

共 41 条

[1] Ah-Pine J., 2018, J MACH LEARN RES, V19, P1
[2] [Anonymous], 2014, Flask Web Development: Developing Web Applications with Python
[3] [Anonymous], INFLUXDB 1 6 DOCUMEN
[4] Blei DM, 2002, ADV NEUR IN, V14, P601
[5] D3: Data-Driven Documents
Bostock, Michael
Ogievetsky, Vadim
Heer, Jeffrey
[J]. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2011, 17 (12) : 2301 - 2309
[6] Operational Data Analytics: Optimizing the National Energy Research Scientific Computing Center Cooling Systems
Bourassa, Norman
Johnson, Walker
Broughton, Jeff
Carter, Deirdre McShane
Joy, Sadie
Vitti, Raphael
Seto, Peter
[J]. PROCEEDINGS OF THE 48TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING WORKSHOPS (ICPP 2019), 2019,
[7] Recurrent Neural Network Attention Mechanisms for Interpretable System Log Anomaly Detection
Brown, Andy
Tuor, Aaron
Hutchinson, Brian
Nichols, Nicole
[J]. PROCEEDINGS OF THE 1ST WORKSHOP ON MACHINE LEARNING FOR COMPUTING SYSTEMS (MLCS 2018), 2018,
[8] Failure Prediction of Data Centers Using Time Series and Fault Tree Analysis
Chalermarrewong, Thanyalak
Achalakul, Tiranee
See, Simon Chong Wee
[J]. PROCEEDINGS OF THE 2012 IEEE 18TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS 2012), 2012, : 794 - 799
[9] Chollet F, 2015, KERAS
[10] Aarohi: Making Real-Time Node Failure Prediction Feasible
Das, Anwesha
Mueller, Frank
Rountree, Barry
[J]. 2020 IEEE 34TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM IPDPS 2020, 2020, : 1092 - 1101

← 1 2 3 4 5 →