Toward an In-Depth Analysis of Multifidelity High Performance Computing Systems

被引:2
作者
Shilpika, Shilpika [1 ,2 ]
Lusch, Bethany [2 ]
Emani, Murali [2 ]
Simini, Filippo [2 ]
Vishwanath, Venkatram [2 ]
Papka, Michael E. [2 ,3 ]
Ma, Kwan-Liu [1 ]
机构
[1] Univ Calif Davis, Dept Comp Sci, Davis, CA 95616 USA
[2] Argonne Natl Lab, Argonne Leadership Comp Facil, Argonne, IL 60439 USA
[3] Northern Illinois Univ, Dept Comp Sci, De Kalb, IL 60115 USA
来源
2022 22ND IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2022) | 2022年
关键词
Error Log Analysis; HPC; Visualization; Time-series Clustering; Machine Learning; Reliability;
D O I
10.1109/CCGrid54584.2022.00081
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
To maintain a robust and reliable supercomputing facility, monitoring it and understanding its hardware system events and behaviors is an essential task. Exascale systems will be increasingly heterogeneous, and the volume of systems data, collected from multiple subsystems and components measured at multiple fidelity levels and temporal resolutions, will continue to grow. In this work, we aim to create an effective solution to analyze diverse and massive datasets gathered from the error logs, job logs, and environment logs of an HPC system, such as a Cray XC40 supercomputer. In this work, we build an end-to-end error log analysis system that analyzes the job logs and gleans insights from their correspondence with hardware error logs and environment logs despite their varying temporal and spatial resolutions. Our machine learning pipeline built in our system is similar to 92% accurate in predicting the job exit status and does so with sufficient lead time for evasive actions to be taken before the actual failure event occurs.
引用
收藏
页码:716 / 725
页数:10
相关论文
共 41 条
  • [1] Ah-Pine J., 2018, J MACH LEARN RES, V19, P1
  • [2] [Anonymous], 2014, Flask Web Development: Developing Web Applications with Python
  • [3] [Anonymous], INFLUXDB 1 6 DOCUMEN
  • [4] Blei DM, 2002, ADV NEUR IN, V14, P601
  • [5] D3: Data-Driven Documents
    Bostock, Michael
    Ogievetsky, Vadim
    Heer, Jeffrey
    [J]. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2011, 17 (12) : 2301 - 2309
  • [6] Operational Data Analytics: Optimizing the National Energy Research Scientific Computing Center Cooling Systems
    Bourassa, Norman
    Johnson, Walker
    Broughton, Jeff
    Carter, Deirdre McShane
    Joy, Sadie
    Vitti, Raphael
    Seto, Peter
    [J]. PROCEEDINGS OF THE 48TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING WORKSHOPS (ICPP 2019), 2019,
  • [7] Recurrent Neural Network Attention Mechanisms for Interpretable System Log Anomaly Detection
    Brown, Andy
    Tuor, Aaron
    Hutchinson, Brian
    Nichols, Nicole
    [J]. PROCEEDINGS OF THE 1ST WORKSHOP ON MACHINE LEARNING FOR COMPUTING SYSTEMS (MLCS 2018), 2018,
  • [8] Failure Prediction of Data Centers Using Time Series and Fault Tree Analysis
    Chalermarrewong, Thanyalak
    Achalakul, Tiranee
    See, Simon Chong Wee
    [J]. PROCEEDINGS OF THE 2012 IEEE 18TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS 2012), 2012, : 794 - 799
  • [9] Chollet F, 2015, KERAS
  • [10] Aarohi: Making Real-Time Node Failure Prediction Feasible
    Das, Anwesha
    Mueller, Frank
    Rountree, Barry
    [J]. 2020 IEEE 34TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM IPDPS 2020, 2020, : 1092 - 1101