Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications

被引:90
作者
Gupta, Saurabh [1 ]
Patel, Tirthak [2 ]
Engelmann, Christian [3 ]
Tiwari, Devesh [2 ]
机构
[1] Intel Labs, Santa Clara, CA 95052 USA
[2] Northeastern Univ, Boston, MA 02115 USA
[3] Oak Ridge Natl Lab, Oak Ridge, TN USA
来源
SC'17: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS | 2017年
关键词
D O I
10.1145/3126908.3126937
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Resilience is one of the key challenges in maintaining high efficiency of future extreme scale supercomputers. Researchers and system practitioners rely on field-data studies to understand reliability characteristics and plan for future HPC systems. In this work, we compare and contrast the reliability characteristics of multiple large-scale HPC production systems. Our study covers more than one billion compute node hours across five different systems over a period of 8 years. We confirm previous findings which continue to be valid, discover new findings, and discuss their implications.
引用
收藏
页数:12
相关论文
共 40 条
  • [1] [Anonymous], INT C DEP SYST NETW
  • [2] [Anonymous], 2012, PROC IEEE INT C HIGH
  • [3] Bairavasundaram L.N., 2008, Characteristics, Impact, and Tolerance of Partial Disk Failures
  • [4] Reducing Waste in Extreme Scale Systems through Introspective Analysis
    Bautista-Gomez, Leonardo
    Gainaru, Ana
    Perarnau, Swann
    Tiwari, Devesh
    Gupta, Saurabh
    Engelmann, Christian
    Cappello, Franck
    Snir, Marc
    [J]. 2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2016), 2016, : 212 - 221
  • [5] TOWARD EXASCALE RESILIENCE
    Cappello, Franck
    Geist, Al
    Gropp, Bill
    Kale, Laxmikant
    Kramer, Bill
    Snir, Marc
    [J]. INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2009, 23 (04) : 374 - 388
  • [6] A higher order estimate of the optimum checkpoint interval for restart dumps
    Daly, JT
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF GRID COMPUTING THEORY METHODS AND APPLICATIONS, 2006, 22 (03): : 303 - 312
  • [7] Di Martino, 2014, 44 INT C DEP SYST NE
  • [8] Measuring and Understanding Extreme-Scale Application Resilience: A Field Study of 5,000,000 HPC Application Runs
    Di Martino, Catello
    Kalbarczyk, Zbigniew
    Kramer, William
    Iyer, Ravishankar
    [J]. 2015 45TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS, 2015, : 25 - 36
  • [9] LOGAIDER: A tool for mining potential correlations of HPC log events
    Di, Sheng
    Gupta, Rinku
    Snir, Marc
    Pershey, Eric
    Cappello, Franck
    [J]. 2017 17TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2017, : 442 - 451
  • [10] El-Sayed Nosayba, 2013, READING LINES FAILUR