Evolution of Monitoring Over the Lifetime of a High Performance Computing Cluster

被引：1

作者：

DeConinck, A. ^{[1
]}

Kelly, K. ^{[1
]}

机构：

[1] Los Alamos Natl Lab, POB 1663, Los Alamos, NM 87544 USA

来源：

2015 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING - CLUSTER 2015 | 2015年

关键词：

D O I：

10.1109/CLUSTER.2015.123

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

High Performance Computer (HPC) systems typically have lifetimes of four to six years. During this lifetime a system will undergo substantial changes in the system software stack and hardware configuration. Simultaneously, the physical environment around it will change as old systems are retired and new systems are brought in. This report focuses on our experience with Mustang, a 1600 node Linux cluster at LANL. Over the three years we have operated Mustang, the machine and environment have changed substantially, which has resulted in reliability and stability issues on the cluster. In this report we present our experiences with standard monitoring and analysis tools available on Mustang since its installation, and how recent advances in our tools and usage have improved our ability to troubleshoot these issues and perform timely root cause analysis. These advances have both improved our management of existing installations as well as informed our hardware and tooling requirements for future systems.

引用

页码：710 / 713

页数：4

共 7 条

[1]

Adaptive Computing, 2014, MOAB HPC SUIT VERS 7

[2]

Lawrence Livermore National Laboratory, 2014, SIMPL LIN UT RES MAN

[3]

Mellanox Technologies, 2011, CONNECTX 2 VPI SINGL

[4]

Michalak S. E., 2015, LAUR1522234 LOS AL N

[5]

Morreale P. W., 2008, DOCUMENTATION PROC S

[6]

Splunk Inc, 2014, SPLUNK VERS 5 0 12

[7] Baler: deterministic, lossless log message clustering tool [J].

Taerat, Narate ;

Brandt, Jim ;

Gentile, Ann ;

Wong, Matthew ;

Leangsuksun, Chokchai .

COMPUTER SCIENCE-RESEARCH AND DEVELOPMENT, 2011, 26 (3-4) :285-295

← 1 →