Understanding the Interplay between Hardware Errors and User Job Characteristics on the Titan Supercomputer

被引:2
作者
Lim, Seung-Hwan [1 ]
Miller, Ross G. [1 ]
Vazhkudai, Sudharshan S. [1 ]
机构
[1] Oak Ridge Natl Lab, Oak Ridge, TN 37831 USA
来源
2020 IEEE 34TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM IPDPS 2020 | 2020年
关键词
D O I
10.1109/IPDPS47924.2020.00028
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Designing dependable supercomputers begins with an understanding of errors in real-world, large-scale systems. The Titan supercomputer at Oak Ridge National Laboratory provides a unique opportunity to investigate errors when an actual system is actively used by multiple concurrent users and workloads from diverse domains at varying scales. This study presents a thorough analysis of 6, 908, 497 hardware errors from 18, 688 compute nodes of Titan for 312, 215 user jobs over a 3-year time period. Through careful joining of two system logs - the Machine Check Architecture (MCA) log and the job scheduler log - we show the correlated pattern of hardware errors for each job and user, in addition to individual descriptive statistics of errors, jobs, and users. Since the majority of hardware errors are memory errors, this study also shows the importance of error correcting in memory systems.
引用
收藏
页码:180 / 190
页数:11
相关论文
共 20 条
[1]  
[Anonymous], 2009, P 4 WORKSH WORKFL SU
[2]  
[Anonymous], 2017, P INT C HIG PERF COM
[3]  
Department of Energy, TAPP NEXT GEN SUP BR
[4]  
Guan Hui, 2019, 33 C NEUR INF PROC S
[5]  
Haque Imran S., 2010, P 2010 10 IEEE ACM I
[6]   Gyrokinetic particle simulation of neoclassical transport in the pedestal/scrape-off region of a tokamak plasma [J].
Ku, S. ;
Chang, C-S ;
Adams, M. ;
Cummings, J. ;
Hinton, F. ;
Keyes, D. ;
Klasky, S. ;
Lee, W. ;
Lin, Z. ;
Parker, S. .
SCIDAC 2006: SCIENTIFIC DISCOVERY THROUGH ADVANCED COMPUTING, 2006, 46 :87-91
[7]  
Levy Scott, 2018, LESSONS LEARNED MEMO
[8]   A flexible I/O arbitration framework for netCDF-based big data processing workflows on high-end supercomputers [J].
Liao, Jianwei ;
Gerofi, Balazs ;
Lien, Guo-Yuan ;
Miyoshi, Takemasa ;
Nishizawa, Seiya ;
Tomita, Hirofumi ;
Liao, Wei-Keng ;
Choudhary, Alok ;
Ishikawa, Yutaka .
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2017, 29 (15)
[9]  
Lim S, 2017, REWIRING BRAIN: A COMPUTATIONAL APPROACH TO STRUCTURAL PLASTICITY IN THE ADULT BRAIN, P465, DOI 10.1016/B978-0-12-803784-3.00022-6
[10]   Hello ADIOS: the challenges and lessons of developing leadership class I/O frameworks [J].
Liu, Qing ;
Logan, Jeremy ;
Tian, Yuan ;
Abbasi, Hasan ;
Podhorszki, Norbert ;
Choi, Jong Youl ;
Klasky, Scott ;
Tchoua, Roselyne ;
Lofstead, Jay ;
Oldfield, Ron ;
Parashar, Manish ;
Samatova, Nagiza ;
Schwan, Karsten ;
Shoshani, Arie ;
Wolf, Matthew ;
Wu, Kesheng ;
Yu, Weikuan .
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2014, 26 (07) :1453-1473