Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System

被引:0
|
作者
Di S. [1 ]
Guo H. [1 ]
Gupta R. [1 ]
Pershey E.R. [1 ]
Snir M. [2 ]
Cappello F. [1 ]
机构
[1] Mathematics and Computer Science (MCS), Argonne National Laboratory, Lemont, 60439, IL
[2] Department of Computer Science, University of Illinois Urbana-Champaign, Champaign, 61820, IL
来源
IEEE Transactions on Parallel and Distributed Systems | 2019年 / 30卷 / 02期
关键词
fatal event analysis; mining correlations; Peta-scale supercomputer; reliability-availability-serviceability (RAS);
D O I
10.1109/tpds.2018.2864184
中图分类号
学科分类号
摘要
In this paper, we explore potential correlations of fatal system events for one of the most powerful supercomputers - IBM Blue Gene/Q Mira, which is deployed at Argonne National Laboratory, based on its 5-year reliability, availability, and serviceability (RAS) log. Our contribution is two-fold. (1) We design an efficient log analysis tool, namely LogAider, with a novel filtering method to effectively extract fatal events from masses of system messages that are heavily duplicated in the log. LogAider exhibits a very precise detection of temporal-correlation with a high similarity (up to 95 percent) to the ground-truth (i.e., compared to the failure records reported by the administrators). The total number of fatal events can be reduced to about 1,255 compared with originally 2.6 million duplicated fatal messages. (2) We analyze the 5-year RAS log of the MIRA system using LogAider, and summarize six important 'takeaways' which can help system vendors and administrators better understand an extreme-scale system's fatal events. Specifically, we find that the distribution or proportion of the fatal system events follow a Pareto-like principle in general. The temporal correlation among fatal events is much stronger than that of warn messages and info messages, and the correlated events tend to constitute a few clusters. The mean time between fatal events (MTBFE) of the Mira system is about 1.3 days from the perspective of the system, and the MTTI is 2-4 days from the perspective of users. The most error-prone item value with respect to any key attribute appears likely in the log every 2-10 days. Weibull, Gamma, and Pearson6 are the three best-fit distributions for the fatal event intervals. The overall correlation of fatal events on the 5D torus network is not prominent, whereas the small-region locality correlation (e.g., the fatal events inside racks) is relatively strong. We believe our work will be interesting to large-scale HPC system administrators and vendors and to fault tolerance researchers, enabling them to better understand fatal events and mitigate such events accordingly. © 2018 IEEE.
引用
收藏
页码:361 / 374
页数:13
相关论文
共 50 条
  • [21] Time-Sharing Redux for Large-scale HPC Systems
    Hofmeyr, Steven
    Iancu, Costin
    Colmenares, Juan A.
    Roman, Eric
    Austin, Brian
    PROCEEDINGS OF 2016 IEEE 18TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS; IEEE 14TH INTERNATIONAL CONFERENCE ON SMART CITY; IEEE 2ND INTERNATIONAL CONFERENCE ON DATA SCIENCE AND SYSTEMS (HPCC/SMARTCITY/DSS), 2016, : 301 - 308
  • [22] A secure system for managing large-scale events on EGNOS: the SPESSS Project
    Mazzucchelli, Luigi
    Casoria, Antonio
    Bonfanti, Marco
    GEOMEDIA, 2006, 10 (01) : 34 - 35
  • [23] Resilient dissemination of events in a large-scale event notification service system
    Lwin, CH
    Mohanty, H
    Ghosh, RK
    Chakraborty, G
    2005 IEEE INTERNATIONAL CONFERENCE ON E-TECHNOLOGY, E-COMMERCE AND E-SERVICE, PROCEEDINGS, 2005, : 502 - 507
  • [24] Towards a system for complex analysis of security events in large-scale networks
    Sapegin, Andrey
    Jaeger, David
    Cheng, Feng
    Meinel, Christoph
    COMPUTERS & SECURITY, 2017, 67 : 16 - 34
  • [25] Visualization of large-scale correlations in gene expressions
    Eriksen, K. A.
    Hornquist, M.
    Sneppen, K.
    FUNCTIONAL & INTEGRATIVE GENOMICS, 2004, 4 (04) : 241 - 245
  • [26] Visualization of large-scale correlations in gene expressions
    Eriksen K.A.
    Hörnquist M.
    Sneppen K.
    Functional & Integrative Genomics, 2004, 4 (4) : 241 - 245
  • [27] Very large-scale correlations in the galaxy distribution
    Labini, F. Sylos
    EPL, 2011, 96 (05)
  • [28] The nature of large-scale correlations in plastic flow
    L. B. Zuev
    V. I. Danilov
    Physics of the Solid State, 1997, 39 : 1241 - 1245
  • [29] The nature of large-scale correlations in plastic flow
    Zuev, LB
    Danilov, VI
    PHYSICS OF THE SOLID STATE, 1997, 39 (08) : 1241 - 1245
  • [30] LARGE-SCALE QSO GALAXY CORRELATIONS REVISITED
    BARTELMANN, M
    SCHNEIDER, P
    ASTRONOMY & ASTROPHYSICS, 1993, 271 (02) : 421 - 424