Towards resilient and energy efficient scalable Krylov solvers

被引:1
作者
Miao, Zheng [1 ]
Calhoun, Jon C. [2 ]
Ge, Rong [2 ]
机构
[1] Hangzhou Dianzi Univ, Hangzhou, Peoples R China
[2] Clemson Univ, Clemson, SC USA
基金
美国国家科学基金会;
关键词
Resilience; Energy-efficiency; Linear solver; Forward-recovery; HPC; Checkpoint-restart; Scalability; FAULT-TOLERANCE; PERFORMANCE;
D O I
10.1016/j.parco.2024.103122
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Exascale computing must simultaneously address both energy efficiency and resilience as power limits impact scalability and faults are more common. Unfortunately, energy efficiency and resilience have been traditionally studied in isolation and optimizing one typically detrimentally impacts the other. To deliver the promised performance within the given power budget, exascale computing mandates a deep understanding of the interplay among energy efficiency, resilience, and scalability. In this work, we propose novel methods to analyze and optimize the costs of common resilience techniques including checkpoint-restart and forward recovery. We focus on sparse linear solvers as they are the fundamental kernels in many scientific applications. In particular, we present generalized analytical and experimental methods to analyze and quantify the time and energy costs of various recovery schemes on computer clusters, and develop and prototype performance optimization and power management strategies to improve energy efficiency. Moreover, we take a deep dive into the forward recovery that recently started to draw attention from researchers, and propose a practical matrix-aware optimization technique to reduce its recovery time. This work shows that while the time and energy costs of various resilience techniques are different, they share the common components and can be quantitatively evaluated with a generalized framework. This analysis framework can be used to guide the design of performance and energy optimization technologies. While each resilience technique has its advantages depending on the fault rate, system size, and power budget, the forward recovery can further benefit from matrix-aware optimizations for large-scale computing.
引用
收藏
页数:12
相关论文
共 55 条
[31]   ALGORITHM-BASED FAULT TOLERANCE FOR MATRIX OPERATIONS [J].
HUANG, KH ;
ABRAHAM, JA .
IEEE TRANSACTIONS ON COMPUTERS, 1984, 33 (06) :518-528
[32]   RESILIENCE FOR MASSIVELY PARALLEL MULTIGRID SOLVERS [J].
Huber, Markus ;
Gmeiner, Bjoern ;
Ruede, Ulrich ;
Wohlmuth, Barbara .
SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2016, 38 (05) :S217-S239
[33]   Exploiting Asynchrony from Exact Forward Recovery for DUE in Iterative Solvers [J].
Jaulmes, Luc ;
Casas, Marc ;
Moreto, Miquel ;
Ayguade, Eduard ;
Labarta, Jesus ;
Valero, Mateo .
PROCEEDINGS OF SC15: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2015,
[34]   Recovery patterns for iterative methods in a parallel unstable environment [J].
Langou, J. ;
Chen, Z. ;
Bosilca, G. ;
Dongarra, J. .
SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2007, 30 (01) :102-116
[35]  
Lee J, 2009, DES AUT CON, P47
[36]   Energy profile of rollback-recovery strategies in high performance computing [J].
Meneses, Esteban ;
Sarood, Osman ;
Kale, Laxmikant V. .
PARALLEL COMPUTING, 2014, 40 (09) :536-547
[37]   Relaxed Replication for Energy Efficient and Resilient GPU Computing [J].
Miao, Zheng ;
Calhoun, Jon C. ;
Ge, Rong .
PROCEEDINGS OF WORKSHOP ON FAULT TOLERANCE FOR HPC AT EXTREME SCALE (FTXS 2021), 2021, :41-50
[38]   Energy Analysis and Optimization for Resilient Scalable Linear Systems [J].
Miao, Zheng ;
Calhoun, Jon ;
Ge, Rong .
2018 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2018, :24-34
[39]  
Mills Bryan., 2013, Proceedings of the 1st International Workshop on Energy Efficient Supercomputing, page, P6
[40]   VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale [J].
Nicolae, Bogdan ;
Moody, Adam ;
Gonsiorowski, Elsa ;
Mohror, Kathryn ;
Cappello, Franck .
2019 IEEE 33RD INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2019), 2019, :911-920