Energy profile of rollback-recovery strategies in high performance computing

被引:18
作者
Meneses, Esteban [1 ]
Sarood, Osman [1 ]
Kale, Laxmikant V. [1 ]
机构
[1] Univ Illinois, Dept Comp Sci, Parallel Programming Lab, Champaign, IL 61820 USA
关键词
Rollback-recovery; Checkpoint/restart; Message logging; Parallel recovery; Energy consumption;
D O I
10.1016/j.parco.2014.03.005
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Extreme-scale computing is set to provide the infrastructure for the advances and breakthroughs that will solve some of the hardest problems in science and engineering. However, resilience and energy concerns loom as two of the major challenges for machines at that scale. The number of components that will be assembled in the supercomputers plays a fundamental role in these challenges. First, a large number of parts will substantially increase the failure rate of the system compared to the failure frequency of current machines. Second, those components have to fit within the power envelope of the installation and keep the energy consumption within operational margins. Extreme-scale machines will have to incorporate fault tolerance mechanisms and honor the energy and power restrictions. Therefore, it is essential to understand how fault tolerance and energy consumption interplay. This paper presents a comparative evaluation and analysis of energy consumption of three different rollback-recovery protocols: checkpoint/restart, message logging and parallel recovery. Our experimental evaluation shows parallel recovery has the minimum execution time and energy consumption. Additionally, we present an analytical model that projects parallel recovery can reduce energy consumption more than 37% compared to checkpoint/restart at extreme scale. (C) 2014 Elsevier B.V. All rights reserved.
引用
收藏
页码:536 / 547
页数:12
相关论文
共 28 条
[1]  
ALVISI L, 1995, INT CON DISTR COMP S, P229, DOI 10.1109/ICDCS.1995.500024
[2]  
[Anonymous], 2011, P ACMIEEE INT C HIGH
[3]  
[Anonymous], 2008, EXASCALE COMPUTING S
[4]  
[Anonymous], 2010, P INT C HIGH PERF CO, DOI DOI 10.1109/SC.2010.18
[5]  
Bautista-Gomez L., 2011, P 2011 INT C HIGH PE, DOI DOI 10.1145/2063384.2063427
[6]  
Bougeret M., 2011, SUP SC 11
[7]  
CHAKRAVORTY S, 2007, P 21 IEEE INT PAR DI
[8]   A higher order estimate of the optimum checkpoint interval for restart dumps [J].
Daly, JT .
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF GRID COMPUTING THEORY METHODS AND APPLICATIONS, 2006, 22 (03) :303-312
[9]  
Diouri M., 2012, 2 WORKSH FAULT TOL H
[10]  
DIOURI ME, 2013, CCGRID, P522