Effects of mesh loop modes on performance of unstructured finite volume GPU simulations

被引：4

作者：

Weng, Yue ^{[1
]}

Zhang, Xi ^{[1
]}

Guo, Xiaohu ^{[2
]}

Zhang, Xianwei ^{[1
]}

Lu, Yutong ^{[1
]}

Liu, Yang ^{[3
]}

机构：

[1] Sun Yat Sen Univ, Sch Comp Sci & Engn, Guangzhou, Peoples R China

[2] Hartree Ctr, STFC Daresbury Lab, Warrington, Cheshire, England

[3] China Aerodynam Res & Dev Ctr, Mianyang, Sichuan, Peoples R China

来源：

ADVANCES IN AERODYNAMICS | 2021年 / 3卷 / 01期

关键词：

GPU; CFD; Finite volume; Unstructured mesh; Mesh loop modes; Data locality; Data dependence; CFD; SOLVERS;

D O I：

10.1186/s42774-021-00073-y

中图分类号：

TH [机械、仪表工业];

学科分类号：

0802 ;

摘要：

In unstructured finite volume method, loop on different mesh components such as cells, faces, nodes, etc is used widely for the traversal of data. Mesh loop results in direct or indirect data access that affects data locality significantly. By loop on mesh, many threads accessing the same data lead to data dependence. Both data locality and data dependence play an important part in the performance of GPU simulations. For optimizing a GPU-accelerated unstructured finite volume Computational Fluid Dynamics (CFD) program, the performance of hot spots under different loops on cells, faces, and nodes is evaluated on Nvidia Tesla V100 and K80. Numerical tests under different mesh scales show that the effects of mesh loop modes are different on data locality and data dependence. Specifically, face loop makes the best data locality, so long as access to face data exists in kernels. Cell loop brings the smallest overheads due to non-coalescing data access, when both cell and node data are used in computing without face data. Cell loop owns the best performance in the condition that only indirect access of cell data exists in kernels. Atomic operations reduced the performance of kernels largely in K80, which is not obvious on V100. With the suitable mesh loop mode in all kernels, the overall performance of GPU simulations can be increased by 15%-20%. Finally, the program on a single GPU V100 can achieve maximum 21.7 and average 14.1 speed up compared with 28 MPI tasks on two Intel CPUs Xeon Gold 6132.

引用

页数：23

共 27 条

[1] Aamodt T. M., 2018, General-Purpose Graphics Processor Architectures
[2] Biedron R. T., 2019, NASATM2019220271
[3] Heterogeneous CPU/GPU co-execution of CFD simulations on the POWER9 architecture: Application to airplane aerodynamics
Borrell, R.
Dosimont, D.
Garcia-Gasulla, M.
Houzeaux, G.
Lehmkuhl, O.
Mehta, V
Owen, H.
Vazquez, M.
Oyarzun, G.
[J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2020, 107 : 31 - 48
[4] Running unstructured grid-based CFD solvers on modern graphics hardware
Corrigan, Andrew
Camelli, Fernando F.
Loehner, Rainald
Wallin, John
[J]. INTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN FLUIDS, 2011, 66 (02) : 221 - 229
[5] CUDA-enabled Sparse Matrix-Vector Multiplication on GPUs using atomic operations
Dang, Hoang-Vu
Schmidt, Bertil
[J]. PARALLEL COMPUTING, 2013, 39 (11) : 737 - 750
[6] SU2: An Open-Source Suite for Multiphysics Simulation and Design
Economon, Thomas D.
Palacios, Francisco
Copeland, Sean R.
Lukaczyk, Trent W.
Alonso, Juan J.
[J]. AIAA JOURNAL, 2016, 54 (03) : 828 - 846
[7] Face coloring in unstructured CFD codes
Giuliani, Andrew
Krivodonova, Lilia
[J]. PARALLEL COMPUTING, 2017, 63 : 17 - 37
[8] [He Xin 赫新], 2016, [空气动力学学报, Acta Aerodynamica Sinica], V34, P267
[9] Stepping up to Summit
Hines, Jonathan
[J]. COMPUTING IN SCIENCE & ENGINEERING, 2018, 20 (02) : 78 - 82
[10] Direct numerical simulation of the flow around a wing section at moderate Reynolds number
Hosseini, S. M.
Vinuesa, R.
Schlatter, P.
Hanifi, A.
Henningson, D. S.
[J]. INTERNATIONAL JOURNAL OF HEAT AND FLUID FLOW, 2016, 61 : 117 - 128

← 1 2 3 →