Evolution of Monitoring Over the Lifetime of a High Performance Computing Cluster

被引:1
|
作者
DeConinck, A. [1 ]
Kelly, K. [1 ]
机构
[1] Los Alamos Natl Lab, POB 1663, Los Alamos, NM 87544 USA
关键词
D O I
10.1109/CLUSTER.2015.123
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
High Performance Computer (HPC) systems typically have lifetimes of four to six years. During this lifetime a system will undergo substantial changes in the system software stack and hardware configuration. Simultaneously, the physical environment around it will change as old systems are retired and new systems are brought in. This report focuses on our experience with Mustang, a 1600 node Linux cluster at LANL. Over the three years we have operated Mustang, the machine and environment have changed substantially, which has resulted in reliability and stability issues on the cluster. In this report we present our experiences with standard monitoring and analysis tools available on Mustang since its installation, and how recent advances in our tools and usage have improved our ability to troubleshoot these issues and perform timely root cause analysis. These advances have both improved our management of existing installations as well as informed our hardware and tooling requirements for future systems.
引用
收藏
页码:710 / 713
页数:4
相关论文
共 50 条
  • [1] High-performance cluster computing over Gigabit/Fast Ethernet
    Sang, Janche
    Kim, Chan M.
    Kollar, Thaddeus J.
    Lopez, Isaac
    Informatica (Ljubljana), 1999, 23 (01): : 19 - 27
  • [2] On the Performance of the WRF Numerical Model over Complex Terrain on a High Performance Computing Cluster
    Christakis, Nicholas
    Katsaounis, Theodoros
    Kossioris, George
    Plexousakis, Michael
    2014 IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, 2014 IEEE 6TH INTL SYMP ON CYBERSPACE SAFETY AND SECURITY, 2014 IEEE 11TH INTL CONF ON EMBEDDED SOFTWARE AND SYST (HPCC,CSS,ICESS), 2014, : 298 - 303
  • [3] Cluster technologies for high performance computing
    Ishii, M
    FOURTH INTERNATIONAL SYMPOSIUM ON PARALLEL ARCHITECTURES, ALGORITHMS, AND NETWORKS (I-SPAN'99), PROCEEDINGS, 1999, : 168 - 170
  • [4] High performance computing on cluster and multicore architecture
    Ashari, Ahmad
    Riasetiawan, Mardhani
    Telkomnika (Telecommunication Computing Electronics and Control), 2015, 13 (04) : 1408 - 1413
  • [5] Cluster computing: A high-performance contender
    Baker, M
    Buyya, R
    Hyde, D
    COMPUTER, 1999, 32 (07) : 79 - +
  • [6] Teaching high-performance computing on a high-performance cluster
    Bernreuther, M
    Brenk, M
    Bungartz, HJ
    Mundani, RP
    Muntean, IL
    COMPUTATIONAL SCIENCE - ICCS 2005, PT 2, 2005, 3515 : 1 - 9
  • [7] High Performance Cluster Monitoring System
    Jiang, Xunfei
    Baigalmaa, Tuguldur
    Lam Nguyen
    Akiyoshi, Daiki
    Ramthun, Eli
    Parajuli, Niraj
    Peck, Charles
    2018 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2018, : 1188 - 1193
  • [8] Design and Performance Measurement of a High-Performance Computing Cluster
    George, Kiran
    Venugopal, Vivek
    2012 IEEE INTERNATIONAL INSTRUMENTATION AND MEASUREMENT TECHNOLOGY CONFERENCE (I2MTC), 2012, : 2531 - 2536
  • [9] Integrating High Performance Computing into Higher Education and the Pedagogy of Cluster Computing
    Stevens, Cody
    Anderson, Sean M.
    Carlson, Adam
    PRACTICE AND EXPERIENCE IN ADVANCED RESEARCH COMPUTING 2024, PEARC 2024, 2024,
  • [10] Tuning a Cluster System for High Performance Computing in Engineering
    Magiera, J.
    Graniczkowski, G.
    Kapusta, P.
    PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND GRID COMPUTING FOR ENGINEERING, 2009, (90): : 427 - 436