Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System

被引：12

作者：

Di, Sheng ^{[1
]}

Guo, Hanqi ^{[1
]}

Gupta, Rinku ^{[1
]}

Pershey, Eric R. ^{[1
]}

Snir, Marc ^{[2
]}

Cappello, Franck ^{[1
]}

机构：

[1] Argonne Natl Lab, MCS, Argonne, IL 60439 USA

[2] Univ Illinois, Dept Comp Sci, Champaign, IL 61820 USA

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2019年 / 30卷 / 02期

关键词：

Peta-scale supercomputer; mining correlations; fatal event analysis; reliability-availability-serviceability (RAS); FAILURES;

D O I：

10.1109/TPDS.2018.2864184

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

In this paper, we explore potential correlations of fatal system events for one of the most powerful supercomputers-IBM Blue Gene/Q Mira, which is deployed at Argonne National Laboratory, based on its 5-year reliability, availability, and serviceability (RAS) log. Our contribution is two-fold. (1) We design an efficient log analysis tool, namely LogAider, with a novel filtering method to effectively extract fatal events from masses of system messages that are heavily duplicated in the log. LogAider exhibits a very precise detection of temporal-correlation with a high similarity (up to 95 percent) to the ground-truth (i.e., compared to the failure records reported by the administrators). The total number of fatal events can be reduced to about 1,255 compared with originally 2.6 million duplicated fatal messages. (2) We analyze the 5-year RAS log of the MIRA system using LogAider, and summarize six important "takeaways" which can help system vendors and administrators better understand an extreme-scale system's fatal events. Specifically, we find that the distribution or proportion of the fatal system events follow a Pareto-like principle in general. The temporal correlation among fatal events is much stronger than that of warn messages and info messages, and the correlated events tend to constitute a few clusters. The mean time between fatal events (MTBFE) of the Mira system is about 1.3 days from the perspective of the system, and the MTTI is 2-4 days from the perspective of users. The most error-prone item value with respect to any key attribute appears likely in the log every 2-10 days. Weibull, Gamma, and Pearson6 are the three best-fit distributions for the fatal event intervals. The overall correlation of fatal events on the 5D torus network is not prominent, whereas the small-region locality correlation (e.g., the fatal events inside racks) is relatively strong. We believe our work will be interesting to large-scale HPC system administrators and vendors and to fault tolerance researchers, enabling them to better understand fatal events and mitigate such events accordingly.

引用

页码：361 / 374

页数：14

共 50 条

[1] Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System
Di S.
Guo H.
Gupta R.
Pershey E.R.
Snir M.
Cappello F.
IEEE Transactions on Parallel and Distributed Systems, 2019, 30 (02): : 361 - 374
[2] The analysis of checkpoint strategies for large-scale CFD simulation in HPC system
Ren Xiaoguang
Xu Xinhai
Tang Yuhua
Fang Xudong
2014 FOURTH INTERNATIONAL CONFERENCE ON COMMUNICATION SYSTEMS AND NETWORK TECHNOLOGIES (CSNT), 2014, : 1097 - 1101
[3] Coupling HPC and Numerical Validation: Accurate and Efficient Simulation of Large-scale Hydrodynamic Events
Moulinec, C.
Denis, C.
Durand, N.
Barber, R. W.
Emerson, D. R.
Gu, X. J.
Razafindrakoto, E.
Issa, R.
Hervouet, J. -M.
PROCEEDINGS OF THE SECOND INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, GRID AND CLOUD COMPUTING FOR ENGINEERING, 2011, 95
[4] Accelerating large-scale HPC Applications using FPGAs
Dimond, Rob
Racaniere, Sebastien
Pell, Oliver
2011 20TH IEEE SYMPOSIUM ON COMPUTER ARITHMETIC (ARITH-20), 2011, : 191 - 192
[5] Advanced HPC Methods for Large-scale Sensitivity Analysis
Cioaca, Alexandru
PROCEEDINGS OF THE 2015 7TH INTERNATIONAL CONFERENCE ON ELECTRONICS, COMPUTERS AND ARTIFICIAL INTELLIGENCE (ECAI), 2015, : E21 - E26
[6] FDTD Method and HPC for Large-Scale Computational Nanophotonics
Lesina, Antonino Cala
Vaccari, Alessandro
Berini, Pierre
Ramunno, Lora
NANO-OPTICS: PRINCIPLES ENABLING BASIC RESEARCH AND APPLICATIONS, 2017, : 435 - 439
[7] Large-Scale Multiple Testing of Correlations
Cai, T. Tony
Liu, Weidong
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2016, 111 (513) : 229 - 240
[8] Understanding GPU Errors on Large-scale HPC Systems and the Implications for System Design and Operation
Tiwari, Devesh
Gupta, Saurabh
Rogers, James
Maxwell, Don
Rech, Paolo
Vazhkudai, Sudharshan
Oliveira, Daniel
Londo, Dave
DeBardeleben, Nathan
Navaux, Philippe
Carro, Luigi
Bland, Arthur
2015 IEEE 21ST INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA), 2015, : 331 - 342
[9] Use of HPC-Techniques for Large-Scale Data Migration
Duennweber, Jan
Mihaylov, Valentin
Glettler, Rene
Maiborn, Volker
Wolff, Holger
EURO-PAR 2011: PARALLEL PROCESSING WORKSHOPS, PT I, 2012, 7155 : 408 - 415
[10] The organization of large-scale sports events
Ano Sanz, V
ARBOR-CIENCIA PENSAMIENTO Y CULTURA, 2000, 165 (650) : 265 - 287

← 1 2 3 4 5 →