Resilience for Stencil Computations with Latent Errors

被引:5
作者
Fang, Aiman [1 ]
Cavelan, Aurelien [2 ,3 ]
Robert, Yves [2 ,3 ,4 ]
Chien, Andrew A. [1 ,5 ]
机构
[1] Univ Chicago, Chicago, IL 60637 USA
[2] Ecole Normale Super Lyon, Lyon, France
[3] INRIA, Rocquencourt, France
[4] Univ Tennessee, Knoxville, TN 37996 USA
[5] Argonne Natl Lab, Argonne, IL 60439 USA
来源
2017 46TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP) | 2017年
关键词
D O I
10.1109/ICPP.2017.67
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Projections and measurements of error rates in near-exascale and exascale systems suggest a dramatic growth, due to extreme scale (10(9) cores), concurrency, software complexity, and deep submicron transistor scaling. Such a growth makes resilience a critical concern, and may increase the incidence of errors that "escape", silently corrupting application state. Such errors can often be revealed by application software tests but with long latencies, and thus are known as latent errors. We explore how to efficiently recover from latent errors, with an approach called application-based focused recovery (ABFR). Specifically we present a case study of stencil computations, a widely useful computational structure, showing how ABFR focuses recovery effort where needed, using intelligent testing and pruning to reduce recovery effort, and enables recovery effort to be overlapped with application computation. We analyze and characterize the ABFR approach on stencils, creating a performance model parameterized by error rate and detection interval (latency). We compare projections from the model to experimental results with the Chombo stencil application, validating the model and showing that ABFR on stencil can achieve a significant reductions in error recovery cost (up to 400x) and recovery latency (up to 4x). Such reductions enable efficient execution at scale with high latent error rates.
引用
收藏
页码:581 / 590
页数:10
相关论文
共 32 条
[1]  
[Anonymous], INT J HIGH PERFORMAN
[2]  
[Anonymous], IEEE T COMPUTERS
[3]  
[Anonymous], 2009, TECH REP
[4]   On the Combination of Silent Error Detection and Checkpointing [J].
Aupy, Guillaume ;
Benoit, Anne ;
Herault, Thomas ;
Robert, Yves ;
Vivien, Frederic ;
Zaidouni, Dounia .
2013 IEEE 19TH PACIFIC RIM INTERNATIONAL SYMPOSIUM ON DEPENDABLE COMPUTING (PRDC 2013), 2013, :11-20
[5]  
Bautista-Gomez L., 2011, SC 11, P1
[6]  
Bergman K., 2008, TR200813 DARPA IPTO
[7]  
Berrocal E., 2015, HPDC 15
[8]  
Cappello F., 2009, INT J HIGH PERFORMAN
[9]  
Cappello Franck, 2014, SUPERCOMPUT FRONT IN
[10]  
Chen Zizhong, 2013, PPOPP 13