Evolution of Monitoring Over the Lifetime of a High Performance Computing Cluster

被引:1
|
作者
DeConinck, A. [1 ]
Kelly, K. [1 ]
机构
[1] Los Alamos Natl Lab, POB 1663, Los Alamos, NM 87544 USA
关键词
D O I
10.1109/CLUSTER.2015.123
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
High Performance Computer (HPC) systems typically have lifetimes of four to six years. During this lifetime a system will undergo substantial changes in the system software stack and hardware configuration. Simultaneously, the physical environment around it will change as old systems are retired and new systems are brought in. This report focuses on our experience with Mustang, a 1600 node Linux cluster at LANL. Over the three years we have operated Mustang, the machine and environment have changed substantially, which has resulted in reliability and stability issues on the cluster. In this report we present our experiences with standard monitoring and analysis tools available on Mustang since its installation, and how recent advances in our tools and usage have improved our ability to troubleshoot these issues and perform timely root cause analysis. These advances have both improved our management of existing installations as well as informed our hardware and tooling requirements for future systems.
引用
收藏
页码:710 / 713
页数:4
相关论文
共 50 条
  • [21] High performance grid and cluster computing for some optimization problems
    Fujisawa, K
    Kojima, M
    Takeda, A
    Yamashita, M
    2004 INTERNATIONAL SYMPOSIUM ON APPLICATIONS AND THE INTERNET WORKSHOPS, PROCEEDINGS, 2004, : 612 - 615
  • [22] Expectations of the High Performance Computing Cluster File System Selection
    O. S. Aladyshev
    B. M. Shabanov
    A. V. Zakharchenko
    Lobachevskii Journal of Mathematics, 2023, 44 : 5132 - 5147
  • [23] A taxonomy of application scheduling tools for high performance cluster computing
    Cao, Jiannong
    Chan, Alvin T. S.
    Sun, Yudong
    Das, Sajal K.
    Guo, Minyi
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2006, 9 (03): : 355 - 371
  • [24] The MOSIX multicomputer operating system for high performance cluster computing
    Barak, A
    La'adan, O
    FUTURE GENERATION COMPUTER SYSTEMS, 1998, 13 (4-5) : 361 - 372
  • [25] Workstation cluster as a parallel hardware environment for high performance computing
    Kahlert, M.
    Wever, U.
    Zheng, Q.
    Lecture Notes in Computer Science, 1156
  • [26] Unsupervised Learning and Image Classification in High Performance Computing Cluster
    Itauma, Itauma
    Aslan, Melih S.
    Villanustre, Flavio
    Chen, Xue-wen
    2015 IEEE 14TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2015, : 576 - 581
  • [27] Benchmark Test of High Performance Computing Cluster Based on HPCC
    Jin Nengzhi
    Zhe Jianwu
    Xiao Haili
    Wang Xiaoning
    Shen Yulin
    2021 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS AND COMPUTER ENGINEERING (ICCECE), 2021, : 469 - 475
  • [28] YellowRiver: A flexible high performance cluster computing service for Grid
    Peng, Liang
    Ng, Lip Kian
    See, Simon
    Eighth International Conference on High-Performance Computing in Asia-Pacific Region, Proceedings, 2005, : 553 - 558
  • [29] Teaching high-performance service in a cluster computing course
    Lopez, Pedro
    Baydal, Elvira
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2018, 117 : 138 - 147
  • [30] Building a high-performance computing cluster using FreeBSD
    Davis, B
    AuYeung, M
    Green, G
    Lee, C
    USENIX ASSOCIATION PROCEEDINGS OF BSDCON '03, 2003, : 35 - 46