A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log

被引:8
作者
Park, Byung H. [1 ]
Hui, Yawei [1 ]
Boehm, Swen [1 ]
Ashraf, Rizwan A. [1 ]
Layton, Christopher [2 ]
Engelmann, Christian [1 ]
机构
[1] Oak Ridge Natl Lab, Comp Sci & Math Div, Oak Ridge, TN 37830 USA
[2] Oak Ridge Natl Lab, Natl Ctr Computat Sci, Oak Ridge, TN USA
来源
2018 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER) | 2018年
关键词
high performance computing; Big Data applications; data analysis; event log analysis;
D O I
10.1109/CLUSTER.2018.00073
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Reliability, availability and serviceability (RAS) logs of high performance computing (HPC) resources, when closely investigated in spatial and temporal dimensions, can provide invaluable information regarding system status, performance, and resource utilization. These data are often generated from multiple logging systems and sensors that cover many components of the system. The analysis of these data for finding persistent temporal and spatial insights faces two main difficulties: the volume of RAS logs makes manual inspection difficult and the unstructured nature and unique properties of log data produced by each subsystem adds another dimension of difficulty in identifying implicit correlation among recorded events. To address these issues, we recently developed a multi-user Big Data analytics framework for HPC log data at Oak Ridge National Laboratory (ORNL). This paper introduces three in-progress data analytics projects that leverage this framework to assess system status, mine event patterns, and study correlations between user applications and system events. We describe the motivation of each project and detail their workflows using three years of log data collected from ORNL's Titan supercomputer.
引用
收藏
页码:571 / 579
页数:9
相关论文
共 19 条
[1]  
Agrawal R., 1993, SIGMOD Record, V22, P207, DOI 10.1145/170036.170072
[2]   Lessons Learned From the Analysis of System Failures at Petascale: The Case of Blue Waters [J].
Di Martino, Catello ;
Kalbarczyk, Zbigniew ;
Iyer, Ravishankar K. ;
Baccanico, Fabio ;
Fullop, Joseph ;
Kramer, William .
2014 44TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN), 2014, :610-621
[3]   Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications [J].
Gupta, Saurabh ;
Patel, Tirthak ;
Engelmann, Christian ;
Tiwari, Devesh .
SC'17: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2017,
[4]  
Han JW, 2000, SIGMOD RECORD, V29, P1
[5]  
Jha S., 2015, P 5 WORKSH FAULT TOL, P11
[6]  
Lakshman Avinash, 2010, Operating Systems Review, V44, P35, DOI 10.1145/1773912.1773922
[7]   FLAP: An End-to-End Event Log Analysis Platform for System Management [J].
Li, Tao ;
Jiang, Yexi ;
Zeng, Chunqiu ;
Xia, Bin ;
Liu, Zheng ;
Zhou, Wubai ;
Zhu, Xiaolong ;
Wang, Wentao ;
Zhang, Liang ;
Wu, Jun ;
Xue, Li ;
Bao, Dewei .
KDD'17: PROCEEDINGS OF THE 23RD ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2017, :1547-1556
[8]  
Liang YL, 2005, I C DEPEND SYS NETWO, P476
[9]  
Mikolov T., 2013, ICLR, P3111
[10]   What supercomputers say: A study of five system logs [J].
Oliner, Adam ;
Stearley, Jon .
37TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS, PROCEEDINGS, 2007, :575-+