Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems

被引：100

作者：

Gupta, Saurabh ^{[2
]}

Tiwari, Devesh ^{[2
]}

Jantzi, Christopher ^{[1
]}

Rogers, James ^{[2
]}

Maxwell, Don ^{[2
]}

机构：

[1] Univ Washington, Seattle, WA 98195 USA

[2] Oak Ridge Natl Lab, Oak Ridge Leadership Comp Facil, Oak Ridge, TN 37831 USA

来源：

2015 45TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS | 2015年

关键词：

D O I：

10.1109/DSN.2015.52

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

As we approach exascale, the scientific simulations are expected to experience more interruptions due to increased system failures. Designing better HPC resilience techniques requires understanding the key characteristics of system failures on these systems. While temporal properties of system failures on HPC systems have been well-investigated, there is limited understanding about the spatial characteristics of system failures and its impact on the resilience mechanisms. Therefore, we examine the spatial characteristics and behavior of system failures. We investigate the interaction between spatial and temporal characteristics of failures and its implications for system operations and resilience mechanisms on large-scale HPC systems. We show that system failures have "spatial locality" at different granularity in the system; study impact of different failure-types; and investigate the correlation among different failure-types. Finally, we propose a novel scheme that exploits the spatial locality in failures to improve application and system performance. Our evaluation shows that the proposed scheme significantly improves the system performance in a dynamic and production-level HPC system.

引用

页码：37 / 44

页数：8

共 19 条

[1]

[Anonymous], 2007, SC'07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing

[2]

[Anonymous], 2011, MANAGING LARGE SCALE

[3]

[Anonymous], DSN

[4]

[Anonymous], 2013, TOP500 SUPERCOMPUTER

[5]

El-Sayed N., 2013, DSN

[6] Quantifying temporal and spatial correlation of failure events for proactive management [J].

Fu, Song ;

Xu, Cheng-Zhong .

SRDS 2007: 26TH IEEE INTERNATIONAL SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, PROCEEDINGS, 2007, :175-+

[7] Failure-aware resource management for high-availability computing clusters with distributed virtual machines [J].

Fu, Song .

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2010, 70 (04) :384-393

[8]

Gainaru Ana., 2012, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, P77

[9] Locality Principle Revisited: A Probability-Based Quantitative Approach [J].

Gupta, Saurabh ;

Xiang, Ping ;

Yang, Yi ;

Zhou, Huiyang .

2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2012, :995-1009

[10]

Li H, 2006, P 2 IEEE INT C E SCI, P27

← 1 2 →