SaNSA - the Supercomputer and Node State Architecture

被引:0
作者
Agarwal, Neil [1 ,2 ]
Greenberg, Hugh [1 ]
Blanchard, Sean [1 ]
DeBardeleben, Nathan [1 ]
机构
[1] Los Alamos Natl Lab, Ultrascale Syst Res Ctr, High Performance Comp, Los Alamos, NM 87545 USA
[2] Univ Calif Berkeley, Berkeley, CA 94720 USA
来源
PROCEEDINGS OF FTXS 2018: IEEE/ACM 8TH WORKSHOP ON FAULT TOLERANCE FOR HPC AT EXTREME SCALE (FTXS) | 2018年
关键词
system state; node state; health monitoring; anomaly detection; software architecture;
D O I
10.1109/FTXS.2018.00011
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In this work we present SaNSA, the Supercomputer and Node State Architecture, a software infrastructure for historical analysis and anomaly detection. SaNSA consumes data from multiple sources including system logs, the resource manager, scheduler, and job logs. Furthermore, additional context such as scheduled maintenance events or dedicated application run times for specific science teams can be overlaid. We discuss how this contextual information allows for more nuanced analysis. SaNSA allows the user to apply arbitrary attributes, for instance, positional information where nodes are located in a data center. We show how using this information we identify anomalous behavior of one rack of a 1,500 node cluster. We explain the design of SaNSA and then test it on four open compute clusters at LANL. We ingest over 1.1 billion lines of system logs in our study of 190 days in 2018. Using SaNSA, we perform a number of different anomaly detection methods and explain their findings in the context of a production supercomputing data center. For example, we report on instances of misconfigured nodes which receive no scheduled jobs for a period of time as well as examples of correlated rack failures which cause jobs to crash.
引用
收藏
页码:69 / 78
页数:10
相关论文
共 15 条
[1]   The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications [J].
Agelastos, Anthony ;
Allan, Benjamin ;
Brandt, Jim ;
Cassella, Paul ;
Enos, Jeremy ;
Fullop, Joshi ;
Gentile, Ann ;
Monk, Steve ;
Naksinehaboon, Nichamon ;
Ogden, Jeff ;
Rajan, Mahesh ;
Showerman, Michael ;
Stevenson, Joel ;
Taerat, Narate ;
Tucker, Tom .
SC14: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2014, :154-165
[2]  
Amvrosiadis G, 2018, PROCEEDINGS OF THE 2018 USENIX ANNUAL TECHNICAL CONFERENCE, P533
[3]  
Bass N., DEAC5207NA27344 US D
[4]   Evolution of Monitoring Over the Lifetime of a High Performance Computing Cluster [J].
DeConinck, A. ;
Kelly, K. .
2015 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING - CLUSTER 2015, 2015, :710-713
[5]   Lessons Learned From the Analysis of System Failures at Petascale: The Case of Blue Waters [J].
Di Martino, Catello ;
Kalbarczyk, Zbigniew ;
Iyer, Ravishankar K. ;
Baccanico, Fabio ;
Fullop, Joseph ;
Kramer, William .
2014 44TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN), 2014, :610-621
[6]   LOGAIDER: A tool for mining potential correlations of HPC log events [J].
Di, Sheng ;
Gupta, Rinku ;
Snir, Marc ;
Pershey, Eric ;
Cappello, Franck .
2017 17TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2017, :442-451
[7]   Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems [J].
Gainaru, Ana ;
Cappello, Franck ;
Kramer, William .
2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2012, :1168-1179
[8]   Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications [J].
Gupta, Saurabh ;
Patel, Tirthak ;
Engelmann, Christian ;
Tiwari, Devesh .
SC'17: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2017,
[9]  
Jha S., 2015, P 5 WORKSH FAULT TOL, P11
[10]   What supercomputers say: A study of five system logs [J].
Oliner, Adam ;
Stearley, Jon .
37TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS, PROCEEDINGS, 2007, :575-+