Towards resilient and energy efficient scalable Krylov solvers

被引：1

作者：

Miao, Zheng ^{[1
]}

Calhoun, Jon C. ^{[2
]}

Ge, Rong ^{[2
]}

机构：

[1] Hangzhou Dianzi Univ, Hangzhou, Peoples R China

[2] Clemson Univ, Clemson, SC USA

来源：

PARALLEL COMPUTING | 2025年 / 123卷

基金：

美国国家科学基金会;

关键词：

Resilience; Energy-efficiency; Linear solver; Forward-recovery; HPC; Checkpoint-restart; Scalability; FAULT-TOLERANCE; PERFORMANCE;

D O I：

10.1016/j.parco.2024.103122

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Exascale computing must simultaneously address both energy efficiency and resilience as power limits impact scalability and faults are more common. Unfortunately, energy efficiency and resilience have been traditionally studied in isolation and optimizing one typically detrimentally impacts the other. To deliver the promised performance within the given power budget, exascale computing mandates a deep understanding of the interplay among energy efficiency, resilience, and scalability. In this work, we propose novel methods to analyze and optimize the costs of common resilience techniques including checkpoint-restart and forward recovery. We focus on sparse linear solvers as they are the fundamental kernels in many scientific applications. In particular, we present generalized analytical and experimental methods to analyze and quantify the time and energy costs of various recovery schemes on computer clusters, and develop and prototype performance optimization and power management strategies to improve energy efficiency. Moreover, we take a deep dive into the forward recovery that recently started to draw attention from researchers, and propose a practical matrix-aware optimization technique to reduce its recovery time. This work shows that while the time and energy costs of various resilience techniques are different, they share the common components and can be quantitatively evaluated with a generalized framework. This analysis framework can be used to guide the design of performance and energy optimization technologies. While each resilience technique has its advantages depending on the fault rate, system size, and power budget, the forward recovery can further benefit from matrix-aware optimizations for large-scale computing.

引用

页数：12

共 55 条

[1] Numerical recovery strategies for parallel resilient Krylov linear solvers [J].

Agullo, Emmanuel ;

Giraud, Luc ;

Guermouche, Abdou ;

Roman, Jean ;

Zounon, Mawussi .

NUMERICAL LINEAR ALGEBRA WITH APPLICATIONS, 2016, 23 (05) :888-905

[2] On the resilience of parallel sparse hybrid solvers [J].

Agullo, Emmanuel ;

Giraud, Luc ;

Zounon, Mawussi .

2015 IEEE 22ND INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), 2015, :75-84

[3]

Aitken Rob, 2015, 2015 IEEE 33rd VLSI Test Symposium (VTS). Proceedings, P1, DOI 10.1109/VTS.2015.7116281

[4]

Amdahl G., 1967, P APR 18 20 1967 SPR, P483, DOI [10.1145/1465482.1465560, DOI 10.1145/1465482.1465560]

[5]

Ashby S., 2010, OPPORTUNITIES CHALLE, P1

[6] Optimal Checkpointing Period: Time vs. Energy [J].

Aupy, Guillaume ;

Benoit, Anne ;

Herault, Thomas ;

Robert, Yves ;

Dongarra, Jack .

HIGH PERFORMANCE COMPUTING SYSTEMS: PERFORMANCE MODELING, BENCHMARKING AND SIMULATION, 2014, 8551 :203-214

[7]

Bienz A., 2017, RAPtor: parallel algebraic multigrid v0.1

[8] Node aware sparse matrix-vector multiplication [J].

Bienz, Amanda ;

Gropp, William D. ;

Olson, Luke N. .

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2019, 130 :166-178

[9]

Bjorck A., 1990, HDB NUMERICAL ANAL, V1, P465, DOI DOI 10.1016/S1570-8659(05)80036-5

[10]

Briggs, 2000, MULTIGRID TUTORIAL

← 1 2 3 4 5 6 →