Evolution of Monitoring Over the Lifetime of a High Performance Computing Cluster

被引：1

作者：

DeConinck, A. ^{[1
]}

Kelly, K. ^{[1
]}

机构：

[1] Los Alamos Natl Lab, POB 1663, Los Alamos, NM 87544 USA

来源：

2015 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING - CLUSTER 2015 | 2015年

关键词：

D O I：

10.1109/CLUSTER.2015.123

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

High Performance Computer (HPC) systems typically have lifetimes of four to six years. During this lifetime a system will undergo substantial changes in the system software stack and hardware configuration. Simultaneously, the physical environment around it will change as old systems are retired and new systems are brought in. This report focuses on our experience with Mustang, a 1600 node Linux cluster at LANL. Over the three years we have operated Mustang, the machine and environment have changed substantially, which has resulted in reliability and stability issues on the cluster. In this report we present our experiences with standard monitoring and analysis tools available on Mustang since its installation, and how recent advances in our tools and usage have improved our ability to troubleshoot these issues and perform timely root cause analysis. These advances have both improved our management of existing installations as well as informed our hardware and tooling requirements for future systems.

引用

页码：710 / 713

页数：4

共 50 条

[31] Optimization of High Performance Computing Cluster Based on Intel MIC
Xu, Shenbo
Wu, Zhonghao
Hong, Yujing
Xue, Qian
Liao, Suiyang
Liu, Boyue
2016 2ND IEEE INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATIONS (ICCC), 2016, : 1028 - 1033
[32] Availability modeling and analysis on high performance cluster computing systems
Song, Hertong
Leangsuksun, Chokchai 'box'
Nassar, Raja
Gottumukkala, Narasirnha Raju
Scott, Stephen
FIRST INTERNATIONAL CONFERENCE ON AVAILABILITY, RELIABILITY AND SECURITY, PROCEEDINGS, 2006, : 305 - +
[33] A taxonomy of application scheduling tools for high performance cluster computing
Jiannong Cao
Alvin T. S. Chan
Yudong Sun
Sajal K. Das
Minyi Guo
Cluster Computing, 2006, 9 : 355 - 371
[34] Availability modeling and evaluation on high performance cluster computing systems
Song, Hertong
Leangsuksun, Chokchai
Nassar, Raja
JOURNAL OF RESEARCH AND PRACTICE IN INFORMATION TECHNOLOGY, 2006, 38 (04): : 317 - 335
[35] Expectations of the High Performance Computing Cluster File System Selection
Aladyshev, O. S.
Shabanov, B. M.
Zakharchenko, A. V.
LOBACHEVSKII JOURNAL OF MATHEMATICS, 2023, 44 (12) : 5132 - 5147
[36] Research on Parallel Task Optimization of High Performance Computing Cluster
Shang, Jiandong
Sheng, Dongpu
Liu, Runjie
Wu, Shuangyan
Li, Panle
PROCEEDINGS OF 2020 IEEE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND INFORMATION SYSTEMS (ICAIIS), 2020, : 777 - 780
[37] Monitoring High Performance Computing Systems for the End User
Moore, Christopher Lee
Khalsa, Prabhu Singh
Yilk, Todd Alan
Mason, Michael
2015 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING - CLUSTER 2015, 2015, : 714 - 716
[38] Automated job monitoring in a High Performance Computing environment
Cromp, RF
Suberri, G
INTERNATIONAL CONFERENCE ON AUTONOMIC COMPUTING, PROCEEDINGS, 2004, : 294 - 295
[39] Performance evaluation of a Windows NT based PC cluster for high performance computing
Alfonsi, G
Muttoni, L
JOURNAL OF SYSTEMS ARCHITECTURE, 2004, 50 (06) : 345 - 359
[40] Ninf and PM: Communication libraries for global computing and high-performance cluster computing
Sato, M
Tezuka, H
Hori, A
Ishikawa, Y
Sekiguchi, S
Nakada, H
Matsuoka, S
Nagashima, U
FUTURE GENERATION COMPUTER SYSTEMS, 1998, 13 (4-5) : 349 - 359

← 1 2 3 4 5 →