Analytical performance estimation during code generation on modern GPUs

被引：5

作者：

Ernst, Dominik ^{[1
]}

Holzer, Markus ^{[2
]}

Hager, Georg ^{[1
]}

Knorr, Matthias ^{[1
]}

Wellein, Gerhard ^{[1
]}

机构：

[1] Friedrich Alexander Univ Erlangen Nurnberg, Erlangen Natl High Performance Comp Ctr NHRFAU, Martensstr 1, D-91058 Erlangen, Germany

[2] Friedrich Alexander Univ Erlangen Nurnberg, Erlangen, Germany

来源：

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING | 2023年 / 173卷

关键词：

GPU; Analytical performance modeling; GPU performance model; Stencil codes; Layer condition;

D O I：

10.1016/j.jpdc.2022.11.003

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Automatic code generation is frequently used to create implementations of algorithms specifically tuned to particular hardware and application parameters. The code generation process involves the selection of adequate code transformations, tuning parameters, and parallelization strategies. We propose an alternative to time-intensive autotuning, scenario-specific performance models, or black-box machine learning to select the best-performing configuration. This paper identifies the relevant performance-defining mechanisms for memory-intensive GPU applications through a performance model coupled with an analytic hardware metric estimator. This enables a quick exploration of large configuration spaces to identify highly efficient code candidates with high accuracy. We examine the changes of the A100 GPU architecture compared to the predecessor V100 and address the challenges of how to model the data transfer volumes through the new memory hierarchy. We show how our method can be coupled to the "pystencils" stencil code generator, which is used to generate kernels for a range-four 3D-25pt stencil and a complex two-phase fluid solver based on the Lattice Boltzmann Method. For both, it delivers a ranking that can be used to select the best-performing candidate. The method is not limited to stencil kernels but can be integrated into any code generator that can generate the required address expressions.

引用

页码：152 / 167

页数：16

共 24 条

[1]

[Anonymous], 2000, SC 2000, DOI 10.1109/SC.2000.10015

[2]

Bauer M., 2019, P INT C HIGH PERFORM

[3] lbmpy: Automatic code generation for efficient parallel lattice Boltzmann methods [J].

Bauer, Martin ;

Koestler, Harald ;

Ruede, Ulrich .

JOURNAL OF COMPUTATIONAL SCIENCE, 2021, 49

[4]

Eitzinger J., 2017, Tools for High Performance Computing 2016, P1, DOI [10.1007/978- 3-319- 56702-0 1, DOI 10.1007/978-3-319-56702-01]

[5]

Ernst D., WARPSPEED PERFORMANC

[6] Opening the Black Box: Performance Estimation during Code Generation for GPUs [J].

Ernst, Dominik ;

Hager, Georg ;

Knorr, Matthias ;

Wellein, Gerhard ;

Holzer, Markus .

2021 IEEE 33RD INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD 2021), 2021, :22-32

[7]

Hammer J., LAYER CONDITION CALC

[8] Highly efficient lattice Boltzmann multiphase simulations of immiscible fluids at high-density ratios on CPUs and GPUs through code generation [J].

Holzer, Markus ;

Bauer, Martin ;

Koestler, Harald ;

Ruede, Ulrich .

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2021, 35 (04) :413-427

[9] Libra: An Automated Code Generation and Tuning Framework for Register-limited Stencils on GPUs [J].

Jin, Mengyao ;

Fu, Haohuan ;

Lv, Zihong ;

Yang, Guangwen .

PROCEEDINGS OF THE ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS (CF'16), 2016, :92-99

[10] Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling [J].

Khairy, Mahmoud ;

Shen, Zhesheng ;

Aamodt, Tor M. ;

Rogers, Timothy G. .

2020 ACM/IEEE 47TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2020), 2020, :473-486

← 1 2 3 →