Exploring Reliability of Exascale Systems through Simulations

被引：0

作者：

Zhao, Dongfang ^{[1
]}

Zhang, Da ^{[1
]}

Wang, Ke ^{[1
]}

Raicu, Ioan ^{[1
,2
]}

机构：

[1] IIT, Dept Comp Sci, Chicago, IL 60616 USA

[2] Argonne Natl Lab, Div Math & Comp Sci, Argonne, IL 60439 USA

来源：

HIGH PERFORMANCE COMPUTING SYMPOSIUM 2013 (HPC 2013) - 2013 SPRING SIMULATION MULTI-CONFERENCE (SPRINGSIM'13) | 2013年 / 45卷 / 06期

基金：

美国国家科学基金会;

关键词：

Exascale Computing; Checkpointing; Fault Tolerance; Parallel Filesystems; Distributed Filesystems;

D O I：

暂无

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Exascale computers are predicted to emerge by the end of this decade with millions of nodes and billions of concurrent cores/threads. One of the most critical challenges for exascale computing is how to effectively and efficiently maintain the system reliability. Checkpointing is the state-of-theart technique for high-end computing system reliability that has proved to work well for current petascale scales. This paper investigates the suitability of checkpointing mechanism for exascale computers, across both parallel filesystems and distributed filesystems. We built a model to emulate exascale systems, and developed a simulator, RXSim, to study its reliability and efficiency. Experiments show that the overall system efficiency and availability would go towards zero as system scales approach exascale with checkpointing mechanism on parallel filesystems. However, the simulations suggest that a distributed filesystem with local persistent storage would offer excellent scalability and aggregate bandwidth, enabling efficient checkpointing at exascale.

引用

页码：1 / 9

页数：9

共 13 条

[1] An analysis of communication induced checkpointing [J].

Alvisi, L ;

Elnozahy, E ;

Rao, S ;

Husain, SA ;

De Mel, A .

TWENTY-NINTH ANNUAL INTERNATIONAL SYMPOSIUM ON FAULT-TOLERANT COMPUTING, DIGEST OF PAPERS, 1999, :242-249

[2]

[Anonymous], 2002, GPFS SHARED DISK FIL

[3] On coordinated checkpointing in distributed systems [J].

Cao, GH ;

Singhal, M .

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 1998, 9 (12) :1213-1225

[4]

Daly J, 2003, LECT NOTES COMPUT SC, V2660, P3

[5] Pageserver: High-Performance SSD-Based Checkpointing of Transactional Distributed Memory [J].

Gerhold, Steffen ;

Kaemmer, Nico ;

Weggerle, Alexander ;

Himpel, Christian ;

Schulthess, Peter .

2010 SECOND INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND APPLICATIONS: ICCEA 2010, PROCEEDINGS, VOL 1, 2010, :235-239

[6] SOFTWARE CHALLENGES FOR EXTREME SCALE COMPUTING: GOING FROM PETASCALE TO EXASCALE SYSTEMS [J].

Heroux, Michael A. .

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2009, 23 (04) :437-439

[7]

Obama B., 2009, STRATEGY AM INNOVATI

[8] Nonblocking checkpointing for optimistic parallel simulation: Description and an implementation [J].

Quaglia, F ;

Santoro, A .

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2003, 14 (06) :593-610

[9]

Raicu I., 2011, Proceedings of the third international workshop on Large-scale system and application performance, P11

[10] Evaluation of Fault-Tolerant Policies Using Simulation [J].

Tikotekar, Anand ;

Vallee, Geoffroy ;

Naughton, Thomas ;

Scott, Stephen L. ;

Leangsuksun, Chokchai .

2007 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, 2007, :303-+

← 1 2 →