Collecting, Monitoring, and Analyzing Facility and Systems Data at the National Energy Research Scientific Computing Center

被引:15
作者
Bautista, Elizabeth [1 ]
Romanus, Melissa [1 ,2 ]
Davis, Thomas [1 ]
Whitney, Cary [1 ]
Kubaska, Theodore [3 ]
机构
[1] Lawrence Berkeley Natl Lab, Berkeley, CA 94720 USA
[2] Rutgers State Univ, New Brunswick, NJ USA
[3] Energy Efficiency HPC Working Grp EE HPC WG, Berkeley, CA USA
来源
PROCEEDINGS OF THE 48TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING WORKSHOPS (ICPP 2019) | 2019年
关键词
data centers; operations; monitoring; high-performance computing; data collection; operational data analytics; time series data; Green HPC;
D O I
10.1145/3339186.3339213
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
As high-performance computing (HPC) resources continue to grow in size and complexity, so too does the volume and velocity of the operational data that is associated with them. At such scales, new mechanisms and technologies are required to continuously gather, store, and analyze this data in near-real time from heterogeneous and distributed sources without impacting the underlying data center operations or HPC resource utilization. In this paper, we describe our experiences in designing and implementing an infrastructure for extreme-scale operational data collection, known as the Operations Monitoring and Notification Infrastructure (OMNI) at the National Energy Research Scientific Computing (NERSC) center at Lawrence Berkeley National Laboratory. OMNI currently holds over 522 billion records of online operational data (totaling over 125TB) and can ingest new data points at an average rate of 25,000 data points per second. Using OMNI as a central repository, facilities and environmental data can be seamlessly integrated and correlated with machine metrics, job scheduler information, network errors, and more, providing a holistic view of data center operations. To demonstrate the value of real-time operational data collection, we present a number of real-world case studies for which having OMNI data readily available led to key operational insights at NERSC. The case results include a reduction in the downtime of an HPC system during a facility transition, as well as a $2.5 million electrical substation savings for the next-generation Perlmutter HPC system.
引用
收藏
页数:9
相关论文
共 4 条
  • [1] [Anonymous], TECHNICAL REPORT
  • [2] Dongarra J. J., Top500
  • [3] Elastic, 2019, EL STACK
  • [4] The Green Grid, 2007, GREEN GRID POW EFF 7