GPU acceleration of an iterative scheme for gas-kinetic model equations with memory reduction techniques

被引:25
|
作者
Zhu, Lianhua [1 ]
Wang, Peng [1 ]
Chen, Songze [2 ]
Guo, Zhaoli [2 ]
Zhang, Yonghao [1 ]
机构
[1] Univ Strathclyde, Dept Mech & Aerosp Engn, James Weir Fluids Lab, Glasgow G1 1XJ, Lanark, Scotland
[2] Huazhong Univ Sci & Technol, Sch Energy & Power, State Key Lab Coal Combust, Wuhan 430074, Hubei, Peoples R China
基金
英国工程与自然科学研究理事会; 欧盟地平线“2020”; 美国国家科学基金会;
关键词
GPU; CUDA; Discrete velocity method; Gas-kinetic equation; High performance computing; DISCRETE VELOCITY GRIDS; STEADY-STATE SOLUTIONS; BOLTZMANN-EQUATION; IMPLICIT SCHEME; POROUS-MEDIA; FLOW; SOLVERS; ALGORITHM; CONTINUUM; SIMULATIONS;
D O I
10.1016/j.cpc.2019.106861
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
This paper presents a Graphics Processing Unit (GPU) acceleration of an iteration-based discrete velocity method (DVM) for gas-kinetic model equations. Unlike the previous GPU parallelization of explicit kinetic schemes, this work is based on a fast converging iterative scheme. The memory reduction techniques previously proposed for DVM are applied for GPU computing, enabling full three-dimensional (3D) solutions of kinetic model equations in the contemporary CPUs usually with a limited memory capacity that otherwise would need terabytes of memory. The GPU algorithm is validated against the direct simulation Monte Carlo (DSMC) simulation of the 3D lid-driven cavity flow and the supersonic rarefied gas flow past a cube with the phase-space grid points up to 0.7 trillion. The computing performance profiling on three models of CPUs shows that the two main kernel functions can utilize 56% similar to 79% of the GPU computing and memory resources. The performance of the GPU algorithm is compared with a typical parallel CPU implementation of the same algorithm using the Message Passing Interface (MPI). The comparison shows that the GPU program on K40 and K80 achieves 1.2 similar to 2.8 and 1.2 similar to 2.4 speedups for the 3D lid-driven cavity flow, respectively, compared with the MPI parallelized CPU program running on 96 CPU cores. (C) 2019 Elsevier B.V. All rights reserved.
引用
收藏
页数:14
相关论文
共 50 条