LRnLA Lattice Boltzmann Method: A Performance Comparison of Implementations on GPU and CPU

被引：4

作者：

Levchenko, Vadim ^{[1
]}

Zakirov, Andrey ^{[1
]}

Perepelkina, Anastasia ^{[1
,2
]}

机构：

[1] Keldysh Inst Appl Math, Moscow, Russia

[2] Kintech Lab Ltd, Moscow, Russia

来源：

PARALLEL COMPUTATIONAL TECHNOLOGIES, PCT 2019 | 2019年 / 1063卷

基金：

俄罗斯科学基金会;

关键词：

LRnLA; LBM; Temporal blocking; Time skewing; GPU; Vectorization; ALGORITHM;

D O I：

10.1007/978-3-030-28163-2_10

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We present an implementation of the Lattice Boltzmann Method (LBM) with Locally Recursive non-Locally Asynchronous (LRnLA) algorithms on GPU and CPU. The algorithm is based on the recursive subdivision of the domain of the dD1T space-time simulation and loosens the memory-bound limit for numerical schemes with local dependencies. We show that LRnLA algorithm allows to overcome the main memory bandwidth limitations in both CPU and GPU implementations. For CPU, we find the data layout that provides alignment for the full use of AVX2/AVX512 vectorization. For GPU, we devise a procedure for pairwise CUDA-block synchronization applied to the implementation of the LRnLA algorithm, which previously worked only on CPU. The performance on GPU is higher, as it is usual in modern implementations. However, the performance gap in our implementation is smaller, thanks to a more efficient CPU version. Through a detailed comparison, we show possible future applications for both the CPU and the GPU implementations of the lattice Boltzmann method in the complex setting.

引用

页码：139 / 151

页数：13

共 14 条

[1] [Anonymous], 2010, SC10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, DOI [10.1109/SC.2010.2, DOI 10.1109/SC.2010.2]
[2] Bailey Peter, 2009, Proceedings of the 2009 International Conference on Parallel Processing (ICPP 2009), P550, DOI 10.1109/ICPP.2009.38
[3] Geier M, 2017, COMPUTATION, V5, DOI 10.3390/computation5020019
[4] Locally Recursive Non-Locally Asynchronous Algorithms for Stencil Computation
Levchenko V.D.
Perepelkina A.Y.
[J]. Lobachevskii Journal of Mathematics, 2018, 39 (4) : 552 - 561
[5] Diamond Torre Algorithm for High-Performance Wave Modeling
Levchenko, Vadim
Perepelkina, Anastasia
Zakirov, Andrey
[J]. COMPUTATION, 2016, 4 (03)
[6] An efficient swap algorithm for the lattice Boltzmann method
Mattila, Keijo
Hyvaluoma, Jari
Rossi, Tuomo
Aspnas, Mats
Westerholm, Jan
[J]. COMPUTER PHYSICS COMMUNICATIONS, 2007, 176 (03) : 200 - 210
[7] A Coupled Approach for Fluid Dynamic Problems Using the PDE Framework Peano
Neumann, Philipp
Bungartz, Hans-Joachim
Mehl, Miriam
Neckel, Tobias
Weinzierl, Tobias
[J]. COMMUNICATIONS IN COMPUTATIONAL PHYSICS, 2012, 12 (01) : 65 - 84
[8] Perepelkina Anastasia, 2019, SUPERCOMPUTING RUSCD, V965, P101
[9] Riesinger C, 2017, COMPUTATION, V5, DOI 10.3390/computation5040048
[10] Designing a graphics processing unit accelerated petaflop capable lattice Boltzmann solver: Read aligned data layouts and asynchronous communication
Robertsen, Fredrik
Westerholm, Jan
Mattila, Keijo
[J]. INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2017, 31 (03) : 246 - 255

← 1 2 →