FTRP: a new fault tolerance framework using process replication and prefetching for high-performance computing

被引:0
作者
Wei Hu
Guang-Ming Liu
Yan-Huang Jiang
机构
[1] National University of Defense Technology,College of Computer
[2] National Supercomputer Center in Tianjin,undefined
来源
Frontiers of Information Technology & Electronic Engineering | 2018年 / 19卷
关键词
High-performance computing; Proactive fault tolerance; Failure locality; Process replication; Process prefetching; TP338.6;
D O I
暂无
中图分类号
学科分类号
摘要
As the scale of supercomputers rapidly grows, the reliability problem dominates the system availability. Existing fault tolerance mechanisms, such as periodic checkpointing and process redundancy, cannot effectively fix this problem. To address this issue, we present a new fault tolerance framework using process replication and prefetching (FTRP), combining the benefits of proactive and reactive mechanisms. FTRP incorporates a novel cost model and a new proactive fault tolerance mechanism to improve the application execution efficiency. The novel cost model, called the ‘work-most’ (WM) model, makes runtime decisions to adaptively choose an action from a set of fault tolerance mechanisms based on failure prediction results and application status. Similar to program locality, we observe the failure locality phenomenon in supercomputers for the first time. In the new proactive fault tolerance mechanism, process replication with process prefetching is proposed based on the failure locality, significantly avoiding losses caused by the failures regardless of whether they have been predicted. Simulations with real failure traces demonstrate that the FTRP framework outperforms existing fault tolerance mechanisms with up to 10% improvement in application efficiency for common failure prediction accuracy, and is effective for petascale systems and beyond.
引用
收藏
页码:1273 / 1290
页数:17
相关论文
共 39 条
[1]  
Daly JT(2006)A higher order estimate of the optimum checkpoint interval for restart dumps Fut Gener Comput Syst 22 303-312
[2]  
Denning PJ(2005)The locality principle Commun ACM 48 19-24
[3]  
Dwork C(1988)Consensus in the presence of partial synchrony J ACM 35 288-323
[4]  
Lynch N(2013)A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems J Supercomput 65 1302-1326
[5]  
Stockmeyer L(2002)A survey of rollback–recovery protocols in message–passing systems ACM Comput Surv 34 375-408
[6]  
Egwutuoha IP(2012)ADFT: an adaptive framework for fault tolerance on large scale systems using application malleability Proc Comput Sci 9 166-175
[7]  
Levy D(2015)Fault tolerance on large scale systems using adaptive process replication IEEE Trans Comput 64 2213-2225
[8]  
Selic B(2013)Locality principle revisited: a probability–based quantitative approach J Parall Distrib Comput 73 1011-1027
[9]  
Elnozahy ENM(2006)Berkeley Lab Checkpoint/Restart (BLCR) for Linux clusters J Phys Conf Ser 46 494-95
[10]  
Alvisi L(2001)A statistical approach to predictive detection Comput Netw 35 77-510