Application health monitoring for extreme-scale resiliency using cooperative fault management

被引：5

作者：

Agarwal, Pratul K. ^{[1
,2
]}

Naughton, Thomas ^{[1
]}

Park, Byung H. ^{[1
]}

Bernholdt, David E. ^{[1
]}

Hursey, Joshua J. ^{[1
,3
]}

Geist, Al ^{[1
]}

机构：

[1] Oak Ridge Natl Lab, Comp Sci & Math Div, Oak Ridge, TN USA

[2] Univ Tennessee, Dept Biochem & Cellular & Mol Biol, Knoxville, TN 37996 USA

[3] IBM Corp, IBM Syst, Rochester, MN USA

来源：

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE | 2020年 / 32卷 / 02期

基金：

美国国家卫生研究院;

关键词：

exascale resiliency; fault tolerance; heterogeneous systems; molecular dynamics; quantum chemistry calculations; silent errors; TOLERANCE; ALGORITHMS;

D O I：

10.1002/cpe.5449

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Resiliency is and will be a critical factor in determining scientific productivity on current and exascale supercomputers, and beyond. Applications oblivious to and incapable of handling transient soft and hard errors could waste supercomputing resources or, worse, yield misleading scientific insights. We introduce a novel application-driven silent error detection and recovery strategy based on application health monitoring. Our methodology uses application output that follows known patterns, as indicators of an application's health and knowledge that violation of these patterns could be indication of faults. Information from system monitors that report hardware and software health status is used to corroborate faults. Collectively, this information is used by a fault coordinator agent to take preventive and corrective measures by applying computational steering to an application between checkpoints. This cooperative fault management system uses the Fault Tolerance Backplane as a communication channel. The benefits of this framework are demonstrated with two real application case studies, molecular dynamics, and quantum chemistry simulations, on scalable clusters with simulated memory and I/O corruptions. The developed approach is general and can be easily applied to other applications.

引用

页数：13

共 73 条

[1] Performance modeling of microsecond scale biological molecular dynamics simulations on heterogeneous architectures
Agarwal, Pratul K.
Hampton, Scott
Poznanovic, Jeffrey
Ramanthan, Arvind
Alam, Sadaf R.
Crozier, Paul S.
[J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2013, 25 (10) : 1356 - 1375
[2] Leveraging Near Data Processing for High-Performance Checkpoint/Restart
Agrawal, Abhinav
Loh, Gabriel H.
Tuck, James
[J]. SC'17: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2017,
[3] Ahlgren V, 2018, 2018 IEEE INT C CLUS
[4] [Anonymous], 2017 46 INT C PAR PR
[5] [Anonymous], P INT C HIGH PERF CO
[6] [Anonymous], 2017, GEN ATOMIC MOL ELECT
[7] Aprá E, 2009, PROCEEDINGS OF THE CONFERENCE ON HIGH PERFORMANCE COMPUTING NETWORKING, STORAGE AND ANALYSIS
[8] Hierarchical error detection in a software implemented fault tolerance (SIFT) environment
Bagchi, S
Srinivasan, B
Whisnant, K
Kalbarczyk, Z
Iyer, RK
[J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2000, 12 (02) : 203 - 224
[9] Bairavasundaram L N., 2007, P 2007 ACM SIGMETRIC
[10] Bent J, 2009, PROCEEDINGS OF THE CONFERENCE ON HIGH PERFORMANCE COMPUTING NETWORKING, STORAGE AND ANALYSIS

← 1 2 3 4 5 6 7 8 →