HPC System Data Pipeline to Enable Meaningful Insights through Analysis-Driven Visualizations

被引：10

作者：

Schwaller, Benjamin ^{[1
]}

Tucker, Nick ^{[2
]}

Tucker, Tom ^{[2
]}

Allan, Benjamin ^{[1
]}

Brandt, Jim ^{[1
]}

机构：

[1] Sandia Natl Labs, POB 5800, Albuquerque, NM 87185 USA

[2] Open Grid Comp, Austin, TX USA

来源：

2020 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2020) | 2020年

关键词：

HPC monitoring; Grafana; visualization; operational data analytics;

D O I：

10.1109/CLUSTER49012.2020.00062

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The increasing complexity of High Performance Computing (HPC) systems has created a growing need for facilitating insight into system performance and utilization for administrators and users. The strides made in HPC system monitoring data collection have produced terabyte/day sized time-series data sets rich with critical information, but it is onerous to extract and construe meaningful information from these metrics. We have designed and developed an architecture that enables flexible, as-needed, run-time analysis and presentation capabilities for HPC monitoring data. Our architecture enables quick and efficient data filtration and analysis. Complex run-time or historical analyses can be expressed as Python-based computations. Results of analyses and a variety of HPC oriented summaries are displayed in a Grafana front-end interface. To demonstrate our architecture, we have deployed it in production for a 1500-node HPC system and have developed analyses and visualizations requested by system administrators, and later employed by users, to track key metrics about the cluster at a job, user, and system level. Our architecture is generic, applicable to any *-nix based system, and it is extensible to supporting multi-cluster HPC centers. We structure it with easily replaced modules that allow unique customization across clusters and centers. In this paper, we describe the data collection and storage infrastructure, the application created to query and analyze data from a custom database, and the visual displays created to provide clear insights into HPC system behavior.

引用

页码：433 / 441

页数：9

共 19 条

[1] The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications [J].

Agelastos, Anthony ;

Allan, Benjamin ;

Brandt, Jim ;

Cassella, Paul ;

Enos, Jeremy ;

Fullop, Joshi ;

Gentile, Ann ;

Monk, Steve ;

Naksinehaboon, Nichamon ;

Ogden, Jeff ;

Rajan, Mahesh ;

Showerman, Michael ;

Stevenson, Joel ;

Taerat, Narate ;

Tucker, Tom .

SC14: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2014, :154-165

[2]

[Anonymous], 2016, P CRAY US GROUP

[3]

Beneventi F, 2017, DES AUT TEST EUROPE, P1038, DOI 10.23919/DATE.2017.7927143

[4]

Dwyer M, 2018, IEEE INT CONF BIG DA, P3585, DOI 10.1109/BigData.2018.8622330

[5]

Eitzinger J, 2019, IEEE INT C CL COMP, P498, DOI 10.1109/CLUSTER.2019.8891017

[6]

Grafana Labs, 2020, GRAF DOC

[7] From Facility to Application Sensor Data: Modular, Continuous and Holistic Monitoring with DCDB [J].

Netti, Alessio ;

Mueller, Micha ;

Auweter, Axel ;

Guillen, Carla ;

Ott, Michael ;

Tafani, Daniele ;

Schulz, Martin .

PROCEEDINGS OF SC19: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2019,

[8]

Nikitenko Dmitry, 2017, RUSSIAN SUPERCOMPUTI

[9]

Noor S., 2017, Time series databases and influxdb

[10]

OpenGridComputing, 2020, SOS SCAL OBJ STOR

← 1 2 →