Evolution of Monitoring Over the Lifetime of a High Performance Computing Cluster

被引:1
作者
DeConinck, A. [1 ]
Kelly, K. [1 ]
机构
[1] Los Alamos Natl Lab, POB 1663, Los Alamos, NM 87544 USA
来源
2015 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING - CLUSTER 2015 | 2015年
关键词
D O I
10.1109/CLUSTER.2015.123
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
High Performance Computer (HPC) systems typically have lifetimes of four to six years. During this lifetime a system will undergo substantial changes in the system software stack and hardware configuration. Simultaneously, the physical environment around it will change as old systems are retired and new systems are brought in. This report focuses on our experience with Mustang, a 1600 node Linux cluster at LANL. Over the three years we have operated Mustang, the machine and environment have changed substantially, which has resulted in reliability and stability issues on the cluster. In this report we present our experiences with standard monitoring and analysis tools available on Mustang since its installation, and how recent advances in our tools and usage have improved our ability to troubleshoot these issues and perform timely root cause analysis. These advances have both improved our management of existing installations as well as informed our hardware and tooling requirements for future systems.
引用
收藏
页码:710 / 713
页数:4
相关论文
共 7 条
[1]  
Adaptive Computing, 2014, MOAB HPC SUIT VERS 7
[2]  
Lawrence Livermore National Laboratory, 2014, SIMPL LIN UT RES MAN
[3]  
Mellanox Technologies, 2011, CONNECTX 2 VPI SINGL
[4]  
Michalak S. E., 2015, LAUR1522234 LOS AL N
[5]  
Morreale P. W., 2008, DOCUMENTATION PROC S
[6]  
Splunk Inc, 2014, SPLUNK VERS 5 0 12
[7]   Baler: deterministic, lossless log message clustering tool [J].
Taerat, Narate ;
Brandt, Jim ;
Gentile, Ann ;
Wong, Matthew ;
Leangsuksun, Chokchai .
COMPUTER SCIENCE-RESEARCH AND DEVELOPMENT, 2011, 26 (3-4) :285-295