Understanding the Interplay between Hardware Errors and User Job Characteristics on the Titan Supercomputer

被引:2
作者
Lim, Seung-Hwan [1 ]
Miller, Ross G. [1 ]
Vazhkudai, Sudharshan S. [1 ]
机构
[1] Oak Ridge Natl Lab, Oak Ridge, TN 37831 USA
来源
2020 IEEE 34TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM IPDPS 2020 | 2020年
关键词
D O I
10.1109/IPDPS47924.2020.00028
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Designing dependable supercomputers begins with an understanding of errors in real-world, large-scale systems. The Titan supercomputer at Oak Ridge National Laboratory provides a unique opportunity to investigate errors when an actual system is actively used by multiple concurrent users and workloads from diverse domains at varying scales. This study presents a thorough analysis of 6, 908, 497 hardware errors from 18, 688 compute nodes of Titan for 312, 215 user jobs over a 3-year time period. Through careful joining of two system logs - the Machine Check Architecture (MCA) log and the job scheduler log - we show the correlated pattern of hardware errors for each job and user, in addition to individual descriptive statistics of errors, jobs, and users. Since the majority of hardware errors are memory errors, this study also shows the importance of error correcting in memory systems.
引用
收藏
页码:180 / 190
页数:11
相关论文
共 20 条
[11]  
Meneses Esteban, 2015, CRAY US GROUP C
[12]  
Neuwirth S., 2016, 2016 28 INT S COMP A
[13]  
Nie B, 2016, INT S HIGH PERF COMP, P519, DOI 10.1109/HPCA.2016.7446091
[14]  
Nie Bin, 2018, 2018 48 ANN IEEE IFI
[15]   A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log [J].
Park, Byung H. ;
Hui, Yawei ;
Boehm, Swen ;
Ashraf, Rizwan A. ;
Layton, Christopher ;
Engelmann, Christian .
2018 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2018, :571-579
[16]  
Podhorszki N., 2007, Proceedings of the 2nd workshop on Work ows in support of large-scale science,WORKS '07, P35
[17]   A Large-Scale Study of Failures in High-Performance Computing Systems [J].
Schroeder, Bianca ;
Gibson, Garth A. .
IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2010, 7 (04) :337-350
[18]  
Tiwari D, 2015, INT S HIGH PERF COMP, P331, DOI 10.1109/HPCA.2015.7056044
[19]  
Zimmer C, 2018, PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE, AND ANALYSIS (SC'18)
[20]  
Zimmer Christopher, 2016, P INT C HIGH PERF CO, P87