TeaMPI-Replication-Based Resilience Without the (Performance) Pain

被引:2
作者
Samfass, Philipp [1 ]
Weinzierl, Tobias [2 ]
Hazelwood, Benjamin [2 ]
Bader, Michael [1 ]
机构
[1] Tech Univ Munich, D-85748 Garching, Germany
[2] Univ Durham, Inst Data Sci, Comp Sci, Durham DH13LE, England
来源
HIGH PERFORMANCE COMPUTING, ISC HIGH PERFORMANCE 2020 | 2020年 / 12151卷
基金
英国工程与自然科学研究理事会;
关键词
FAULT-TOLERANCE; DESIGN;
D O I
10.1007/978-3-030-50743-5_23
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In an era where we can not afford to checkpoint frequently, replication is a generic way forward to construct numerical simulations that can continue to run even if hardware parts fail. Yet, replication often is not employed on larger scales, as naively mirroring a computation once effectively halves the machine size, and as keeping replicated simulations consistent with each other is not trivial. We demonstrate for the ExaHyPE engine-a task-based solver for hyperbolic equation systems-that it is possible to realise resiliency without major code changes on the user side, while we introduce a novel algorithmic idea where replication reduces the time-to-solution. The redundant CPU cycles are not burned "for nothing". Our work employs a weakly consistent data model where replicas run independently yet inform each other through heart-beat messages whether they are still up and running. Our key performance idea is to let the tasks of the replicated simulations share some of their outcomes, while we shuffle the actual task execution order per replica. This way, replicated ranks can skip some local computations and automatically start to synchronise with each other. Our experiments with a production-level seismic wave-equation solver provide evidence that this novel concept has the potential to make replication affordable for large-scale simulations in high-performance computing.
引用
收藏
页码:455 / 473
页数:19
相关论文
共 31 条
[1]   Soft fault detection and correction for multigrid [J].
Altenbernd, Mirco ;
Goeddeke, Dominik .
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2018, 32 (06) :897-912
[2]  
[Anonymous], 2017, LNCS, V10104, P635, DOI [10.1007/978-3-319-58943-551, DOI 10.1007/978-3-319-58943-551]
[3]  
Biswas S., 2011, Proceedings of the 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2011), P152, DOI 10.1109/IPDPS.2011.24
[4]   Post-failure recovery of MPI communication capability: Design and rationale [J].
Bland, Wesley ;
Bouteiller, Aurelien ;
Herault, Thomas ;
Bosilca, George ;
Dongarra, Jack .
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2013, 27 (03) :244-254
[5]   Design for a Soft Error Resilient Dynamic Task-based Runtime [J].
Cao, Chongxiao ;
Herault, Thomas ;
Bosilca, George ;
Dongarra, Jack .
2015 IEEE 29TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2015, :765-774
[6]   FAULT TOLERANCE IN PETASCALE/EXASCALE SYSTEMS: CURRENT KNOWLEDGE, CHALLENGES AND RESEARCH OPPORTUNITIES [J].
Cappello, Franck .
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2009, 23 (03) :212-226
[7]   Studies on the energy and deep memory behaviour of a cache-oblivious, task-based hyperbolic PDE solver [J].
Charrier, Dominic E. ;
Hazelwood, Benjamin ;
Tutlyaeva, Ekaterina ;
Bader, Michael ;
Dumbser, Michael ;
Kudryavtsev, Andrey ;
Moskovsky, Alexander ;
Weinzierl, Tobias .
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2019, 33 (05) :973-986
[8]   ENCLAVE TASKING FOR DG METHODS ON DYNAMICALLY ADAPTIVE MESHES [J].
Charrier, Dominic Etienne ;
Hazelwood, Benjamin ;
Weinzierl, Tobias .
SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2020, 42 (03) :C69-C96
[9]  
Chen Zizhong., 2005, P 10 ACM SIGPLAN S P, P213
[10]  
Chung J, 2012, INT CONF HIGH PERFOR