Energy Analysis and Optimization for Resilient Scalable Linear Systems

被引:5
作者
Miao, Zheng [1 ]
Calhoun, Jon [1 ]
Ge, Rong [1 ]
机构
[1] Clemson Univ, Clemson, SC 29631 USA
来源
2018 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER) | 2018年
基金
美国国家科学基金会;
关键词
Resilience; Energy-Efficiency; Forward-Recovery; HPC; FAULT-TOLERANCE; PERFORMANCE;
D O I
10.1109/CLUSTER.2018.00015
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Exascale computing must simultaneously address both energy efficiency and resilience as power limits impact scalability and faults are more common. Unfortunately, energy efficiency and resilience have been traditionally studied in isolation and optimizing one typically detrimentally impacts the other. To deliver the promised performance within the given power budget, exascale computing mandates a deep understanding of the interplay among energy efficiency, resilience, and scalability. In this work, we propose novel methods to analyze and optimize costs of resilience techniques including checkpoint-restart and forward recovery for large sparse linear system solvers. In particular, we present experimental and analytical methods to analyze and quantify the time and energy costs of recovery schemes on computer clusters. We further develop and prototype performance optimization and power management strategies to improve energy efficiency. Experimental results show that recovery schemes incur different time and energy overheads and optimization techniques significantly reduce such overheads. This work suggests that resilience techniques should be adaptively adjusted to a given fault rate, system size, and power budget.
引用
收藏
页码:24 / 34
页数:11
相关论文
共 39 条
[1]   Numerical recovery strategies for parallel resilient Krylov linear solvers [J].
Agullo, Emmanuel ;
Giraud, Luc ;
Guermouche, Abdou ;
Roman, Jean ;
Zounon, Mawussi .
NUMERICAL LINEAR ALGEBRA WITH APPLICATIONS, 2016, 23 (05) :888-905
[2]   On the resilience of parallel sparse hybrid solvers [J].
Agullo, Emmanuel ;
Giraud, Luc ;
Zounon, Mawussi .
2015 IEEE 22ND INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), 2015, :75-84
[3]  
Aitken Rob, 2015, 2015 IEEE 33rd VLSI Test Symposium (VTS). Proceedings, P1, DOI 10.1109/VTS.2015.7116281
[4]  
[Anonymous], 2002, Scientific Computing: An Introductory Survey
[5]  
[Anonymous], P IFIP INT C DEP SYS
[6]  
Ashby S., 2010, OPPORTUNITIES CHALLE, P1
[7]  
Aupy G., 2013, INT WORKSHOP PERFORM, P203
[8]   Basic concepts and taxonomy of dependable and secure computing [J].
Avizienis, A ;
Laprie, JC ;
Randell, B ;
Landwehr, C .
IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2004, 1 (01) :11-33
[9]  
Bienz A., 2017, RAPtor: parallel algebraic multigrid v0.1
[10]  
Bienz A., 2016, URBANA, V51