Evolution of Monitoring Over the Lifetime of a High Performance Computing Cluster

被引:1
|
作者
DeConinck, A. [1 ]
Kelly, K. [1 ]
机构
[1] Los Alamos Natl Lab, POB 1663, Los Alamos, NM 87544 USA
关键词
D O I
10.1109/CLUSTER.2015.123
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
High Performance Computer (HPC) systems typically have lifetimes of four to six years. During this lifetime a system will undergo substantial changes in the system software stack and hardware configuration. Simultaneously, the physical environment around it will change as old systems are retired and new systems are brought in. This report focuses on our experience with Mustang, a 1600 node Linux cluster at LANL. Over the three years we have operated Mustang, the machine and environment have changed substantially, which has resulted in reliability and stability issues on the cluster. In this report we present our experiences with standard monitoring and analysis tools available on Mustang since its installation, and how recent advances in our tools and usage have improved our ability to troubleshoot these issues and perform timely root cause analysis. These advances have both improved our management of existing installations as well as informed our hardware and tooling requirements for future systems.
引用
收藏
页码:710 / 713
页数:4
相关论文
共 50 条
  • [31] Optimization of High Performance Computing Cluster Based on Intel MIC
    Xu, Shenbo
    Wu, Zhonghao
    Hong, Yujing
    Xue, Qian
    Liao, Suiyang
    Liu, Boyue
    2016 2ND IEEE INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATIONS (ICCC), 2016, : 1028 - 1033
  • [32] Availability modeling and analysis on high performance cluster computing systems
    Song, Hertong
    Leangsuksun, Chokchai 'box'
    Nassar, Raja
    Gottumukkala, Narasirnha Raju
    Scott, Stephen
    FIRST INTERNATIONAL CONFERENCE ON AVAILABILITY, RELIABILITY AND SECURITY, PROCEEDINGS, 2006, : 305 - +
  • [33] A taxonomy of application scheduling tools for high performance cluster computing
    Jiannong Cao
    Alvin T. S. Chan
    Yudong Sun
    Sajal K. Das
    Minyi Guo
    Cluster Computing, 2006, 9 : 355 - 371
  • [34] Availability modeling and evaluation on high performance cluster computing systems
    Song, Hertong
    Leangsuksun, Chokchai
    Nassar, Raja
    JOURNAL OF RESEARCH AND PRACTICE IN INFORMATION TECHNOLOGY, 2006, 38 (04): : 317 - 335
  • [35] Expectations of the High Performance Computing Cluster File System Selection
    Aladyshev, O. S.
    Shabanov, B. M.
    Zakharchenko, A. V.
    LOBACHEVSKII JOURNAL OF MATHEMATICS, 2023, 44 (12) : 5132 - 5147
  • [36] Research on Parallel Task Optimization of High Performance Computing Cluster
    Shang, Jiandong
    Sheng, Dongpu
    Liu, Runjie
    Wu, Shuangyan
    Li, Panle
    PROCEEDINGS OF 2020 IEEE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND INFORMATION SYSTEMS (ICAIIS), 2020, : 777 - 780
  • [37] Monitoring High Performance Computing Systems for the End User
    Moore, Christopher Lee
    Khalsa, Prabhu Singh
    Yilk, Todd Alan
    Mason, Michael
    2015 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING - CLUSTER 2015, 2015, : 714 - 716
  • [38] Automated job monitoring in a High Performance Computing environment
    Cromp, RF
    Suberri, G
    INTERNATIONAL CONFERENCE ON AUTONOMIC COMPUTING, PROCEEDINGS, 2004, : 294 - 295
  • [39] Performance evaluation of a Windows NT based PC cluster for high performance computing
    Alfonsi, G
    Muttoni, L
    JOURNAL OF SYSTEMS ARCHITECTURE, 2004, 50 (06) : 345 - 359
  • [40] Ninf and PM: Communication libraries for global computing and high-performance cluster computing
    Sato, M
    Tezuka, H
    Hori, A
    Ishikawa, Y
    Sekiguchi, S
    Nakada, H
    Matsuoka, S
    Nagashima, U
    FUTURE GENERATION COMPUTER SYSTEMS, 1998, 13 (4-5) : 349 - 359