A Large-Scale Study of Failures in High-Performance Computing Systems

被引:378
|
作者
Schroeder, Bianca [1 ]
Gibson, Garth A. [2 ]
机构
[1] Univ Toronto, Dept Comp Sci, Toronto, ON M5S 2E4, Canada
[2] Carnegie Mellon Univ, Dept Comp Sci, Pittsburgh, PA 15213 USA
关键词
Large-scale systems; high-performance computing; supercomputing; reliability; failures; node outages; field study; empirical study; repair time; time between failures; root cause;
D O I
10.1109/TDSC.2009.4
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Designing highly dependable systems requires a good understanding of failure characteristics. Unfortunately, little raw data on failures in large IT installations are publicly available. This paper analyzes failure data collected at two large high-performance computing sites. The first data set has been collected over the past nine years at Los Alamos National Laboratory ( LANL) and has recently been made publicly available. It covers 23,000 failures recorded on more than 20 different systems at LANL, mostly large clusters of SMP and NUMA nodes. The second data set has been collected over the period of one year on one large supercomputing system comprising 20 nodes and more than 10,000 processors. We study the statistics of the data, including the root cause of failures, the mean time between failures, and the mean time to repair. We find, for example, that average failure rates differ wildly across systems, ranging from 20-1000 failures per year, and that time between failures is modeled well by a Weibull distribution with decreasing hazard rate. From one system to another, mean repair time varies from less than an hour to more than a day, and repair times are well modeled by a lognormal distribution.
引用
收藏
页码:337 / 350
页数:14
相关论文
共 50 条
  • [1] A large-scale study of failures in high-performance computing systems
    Schroeder, Bianca
    Gibson, Garth A.
    DSN 2006 INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS, PROCEEDINGS, 2006, : 249 - 258
  • [2] Job failures in high performance computing systems: A large-scale empirical study
    Yuan, Yulai
    Wu, Yongwei
    Wang, Qiuping
    Yang, Guangwen
    Zheng, Weimin
    COMPUTERS & MATHEMATICS WITH APPLICATIONS, 2012, 63 (02) : 365 - 377
  • [3] High-performance computing for large-scale analysis, optimization, and control
    Adeli, H
    JOURNAL OF AEROSPACE ENGINEERING, 2000, 13 (01) : 1 - 10
  • [4] High-performance computing for large-scale analysis, optimization, and control
    Adeli, Hojjat, 1600, ASCE, Reston, VA, United States (13):
  • [5] Predictive Dynamic Simulation for Large-Scale Power Systems through High-Performance Computing
    Huang, Zhenyu
    Jin, Shuangshuang
    Diao, Ruisheng
    2012 SC COMPANION: HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SCC), 2012, : 347 - 354
  • [6] An ASP model for large-scale genomics in a high-performance computing environment
    Cuticchia, J
    Zaifman, L
    Wallace, S
    Hulbert, G
    Silk, GW
    HIGH PERFORMANCE COMPUTING SYSTEMS AND APPLICATIONS, 2003, 727 : 3 - 3
  • [7] Towards Portable Large-Scale Image Processing with High-Performance Computing
    Huo, Yuankai
    Blaber, Justin
    Damon, Stephen M.
    Boyd, Brian D.
    Bao, Shunxing
    Parvathaneni, Prasanna
    Noguera, Camilo Bermudez
    Chaganti, Shikha
    Nath, Vishwesh
    Greer, Jasmine M.
    Lyu, Ilwoo
    French, William R.
    Newton, Allen T.
    Rogers, Baxter P.
    Landman, Bennett A.
    JOURNAL OF DIGITAL IMAGING, 2018, 31 (03) : 304 - 314
  • [8] Large-Scale Cryogenic Integration Approach for Superconducting High-Performance Computing
    Das, Rabindra N.
    Bolkhovsky, Vladimir
    Tolpygo, Sergey K.
    Gouker, Pascale
    Johnson, Leonard M.
    Dauler, Eric A.
    Gouker, Mark A.
    2017 IEEE 67TH ELECTRONIC COMPONENTS AND TECHNOLOGY CONFERENCE (ECTC 2017), 2017, : 675 - 683
  • [9] Towards Portable Large-Scale Image Processing with High-Performance Computing
    Yuankai Huo
    Justin Blaber
    Stephen M. Damon
    Brian D. Boyd
    Shunxing Bao
    Prasanna Parvathaneni
    Camilo Bermudez Noguera
    Shikha Chaganti
    Vishwesh Nath
    Jasmine M. Greer
    Ilwoo Lyu
    William R. French
    Allen T. Newton
    Baxter P. Rogers
    Bennett A. Landman
    Journal of Digital Imaging, 2018, 31 : 304 - 314
  • [10] Large-scale urban traffic simulation with Scala and high-performance computing system
    Janczykowski, Michal
    Turek, Wojciech
    Malawski, Maciej
    Byrski, Aleksander
    JOURNAL OF COMPUTATIONAL SCIENCE, 2019, 35 : 91 - 101