LRnLA Lattice Boltzmann Method: A Performance Comparison of Implementations on GPU and CPU

被引:4
作者
Levchenko, Vadim [1 ]
Zakirov, Andrey [1 ]
Perepelkina, Anastasia [1 ,2 ]
机构
[1] Keldysh Inst Appl Math, Moscow, Russia
[2] Kintech Lab Ltd, Moscow, Russia
来源
PARALLEL COMPUTATIONAL TECHNOLOGIES, PCT 2019 | 2019年 / 1063卷
基金
俄罗斯科学基金会;
关键词
LRnLA; LBM; Temporal blocking; Time skewing; GPU; Vectorization; ALGORITHM;
D O I
10.1007/978-3-030-28163-2_10
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present an implementation of the Lattice Boltzmann Method (LBM) with Locally Recursive non-Locally Asynchronous (LRnLA) algorithms on GPU and CPU. The algorithm is based on the recursive subdivision of the domain of the dD1T space-time simulation and loosens the memory-bound limit for numerical schemes with local dependencies. We show that LRnLA algorithm allows to overcome the main memory bandwidth limitations in both CPU and GPU implementations. For CPU, we find the data layout that provides alignment for the full use of AVX2/AVX512 vectorization. For GPU, we devise a procedure for pairwise CUDA-block synchronization applied to the implementation of the LRnLA algorithm, which previously worked only on CPU. The performance on GPU is higher, as it is usual in modern implementations. However, the performance gap in our implementation is smaller, thanks to a more efficient CPU version. Through a detailed comparison, we show possible future applications for both the CPU and the GPU implementations of the lattice Boltzmann method in the complex setting.
引用
收藏
页码:139 / 151
页数:13
相关论文
共 14 条
  • [1] [Anonymous], 2010, SC10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, DOI [10.1109/SC.2010.2, DOI 10.1109/SC.2010.2]
  • [2] Bailey Peter, 2009, Proceedings of the 2009 International Conference on Parallel Processing (ICPP 2009), P550, DOI 10.1109/ICPP.2009.38
  • [3] Geier M, 2017, COMPUTATION, V5, DOI 10.3390/computation5020019
  • [4] Locally Recursive Non-Locally Asynchronous Algorithms for Stencil Computation
    Levchenko V.D.
    Perepelkina A.Y.
    [J]. Lobachevskii Journal of Mathematics, 2018, 39 (4) : 552 - 561
  • [5] Diamond Torre Algorithm for High-Performance Wave Modeling
    Levchenko, Vadim
    Perepelkina, Anastasia
    Zakirov, Andrey
    [J]. COMPUTATION, 2016, 4 (03)
  • [6] An efficient swap algorithm for the lattice Boltzmann method
    Mattila, Keijo
    Hyvaluoma, Jari
    Rossi, Tuomo
    Aspnas, Mats
    Westerholm, Jan
    [J]. COMPUTER PHYSICS COMMUNICATIONS, 2007, 176 (03) : 200 - 210
  • [7] A Coupled Approach for Fluid Dynamic Problems Using the PDE Framework Peano
    Neumann, Philipp
    Bungartz, Hans-Joachim
    Mehl, Miriam
    Neckel, Tobias
    Weinzierl, Tobias
    [J]. COMMUNICATIONS IN COMPUTATIONAL PHYSICS, 2012, 12 (01) : 65 - 84
  • [8] Perepelkina Anastasia, 2019, SUPERCOMPUTING RUSCD, V965, P101
  • [9] Riesinger C, 2017, COMPUTATION, V5, DOI 10.3390/computation5040048
  • [10] Designing a graphics processing unit accelerated petaflop capable lattice Boltzmann solver: Read aligned data layouts and asynchronous communication
    Robertsen, Fredrik
    Westerholm, Jan
    Mattila, Keijo
    [J]. INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2017, 31 (03) : 246 - 255