Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications

被引：95

作者：

Gupta, Saurabh ^{[1
]}

Patel, Tirthak ^{[2
]}

Engelmann, Christian ^{[3
]}

Tiwari, Devesh ^{[2
]}

机构：

[1] Intel Labs, Santa Clara, CA 95052 USA

[2] Northeastern Univ, Boston, MA 02115 USA

[3] Oak Ridge Natl Lab, Oak Ridge, TN USA

来源：

SC'17: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS | 2017年

关键词：

D O I：

10.1145/3126908.3126937

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Resilience is one of the key challenges in maintaining high efficiency of future extreme scale supercomputers. Researchers and system practitioners rely on field-data studies to understand reliability characteristics and plan for future HPC systems. In this work, we compare and contrast the reliability characteristics of multiple large-scale HPC production systems. Our study covers more than one billion compute node hours across five different systems over a period of 8 years. We confirm previous findings which continue to be valid, discover new findings, and discuss their implications.

引用

页数：12

共 40 条

[1]

[Anonymous], INT C DEP SYST NETW

[2]

[Anonymous], 2012, PROC IEEE INT C HIGH

[3]

Bairavasundaram L.N., 2008, Characteristics, Impact, and Tolerance of Partial Disk Failures

[4] Reducing Waste in Extreme Scale Systems through Introspective Analysis [J].

Bautista-Gomez, Leonardo ;

Gainaru, Ana ;

Perarnau, Swann ;

Tiwari, Devesh ;

Gupta, Saurabh ;

Engelmann, Christian ;

Cappello, Franck ;

Snir, Marc .

2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2016), 2016, :212-221

[5] TOWARD EXASCALE RESILIENCE [J].

Cappello, Franck ;

Geist, Al ;

Gropp, Bill ;

Kale, Laxmikant ;

Kramer, Bill ;

Snir, Marc .

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2009, 23 (04) :374-388

[6] A higher order estimate of the optimum checkpoint interval for restart dumps [J].

Daly, JT .

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF GRID COMPUTING THEORY METHODS AND APPLICATIONS, 2006, 22 (03) :303-312

[7]

Di Martino, 2014, 44 INT C DEP SYST NE

[8] Measuring and Understanding Extreme-Scale Application Resilience: A Field Study of 5,000,000 HPC Application Runs [J].

Di Martino, Catello ;

Kalbarczyk, Zbigniew ;

Kramer, William ;

Iyer, Ravishankar .

2015 45TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS, 2015, :25-36

[9] LOGAIDER: A tool for mining potential correlations of HPC log events [J].

Di, Sheng ;

Gupta, Rinku ;

Snir, Marc ;

Pershey, Eric ;

Cappello, Franck .

2017 17TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2017, :442-451

[10]

El-Sayed Nosayba, 2013, READING LINES FAILUR

← 1 2 3 4 →