Multi-GPU systems and Unified Virtual Memory for scientific applications: The case of the NAS multi-zone parallel benchmarks

被引：4

作者：

Gonzalez, Marc ^{[1
]}

Morancho, Enric ^{[1
]}

机构：

[1] Univ Politecn Catalunya BarcelonaTECH, Dept Comp Architecture, Barcelona, Spain

来源：

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING | 2021年 / 158卷

关键词：

Multi-GPU; Unified Virtual Memory; Single address space; NAS parallel benchmarks;

D O I：

10.1016/j.jpdc.2021.08.001

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

GPU-based computing systems have become a widely accepted solution for the high-performance-computing (HPC) domain. GPUs have shown highly competitive performance-per-watt ratios and can exploit an astonishing level of parallelism. However, exploiting the peak performance of such devices is a challenge, mainly due to the combination of two essential aspects of multi-GPU execution: memory allocation and work distribution. Memory allocation determines the data mapping to GPUs, and therefore conditions all work distribution schemes and communication phases in the application. Unified Virtual Memory simplifies the codification of memory allocations, but its effects on performance depend on how data is used by the devices and how the devices' driver is going to orchestrate the data transfers across the system. In this paper we present a multi-GPU and Unified Virtual Memory (UM) implementation of the NAS Multi-Zone Parallel Benchmarks which alternate communication and computation phases offering opportunities to overlap these phases. We analyse the programmability and performance effects of the introduction of the UM support. Our experience shows that the programming efforts for introducing UM are similar to those of having a memory allocation per GPU. On an evaluation environment composed of 2 x IBM Power9 8335-GTH and 4 x GPU NVIDIA V100 (Volta), our UM-based parallelization outperforms the manual memory allocation versions by 1.10x to 1.85x. However, these improvements are highly sensitive to the information forwarded to the devices' driver describing the most convenient location for specific memory regions. We analyse these improvements in terms of the relationship between the computational and communication phases of the applications. (C) 2021 The Author(s). Published by Elsevier Inc.

引用

页码：138 / 150

页数：13

共 28 条

[1] Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
[2] Araujo G., 2020, 28 EUR INT C PAR DIS
[3] OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training
Awan, Ammar Ahmad
Chu, Ching-Hsiang
Subramoni, Hari
Lu, Xiaoyi
Panda, Dhabaleswar K.
[J]. 2018 IEEE 25TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), 2018, : 143 - 152
[4] THE NAS PARALLEL BENCHMARKS
BAILEY, DH
BARSZCZ, E
BARTON, JT
BROWNING, DS
CARTER, RL
DAGUM, L
FATOOHI, RA
FREDERICKSON, PO
LASINSKI, TA
SCHREIBER, RS
SIMON, HD
VENKATAKRISHNAN, V
WEERATUNGA, SK
[J]. INTERNATIONAL JOURNAL OF SUPERCOMPUTER APPLICATIONS AND HIGH PERFORMANCE COMPUTING, 1991, 5 (03): : 63 - 73
[5] Bull JM, 1998, LECT NOTES COMPUT SC, V1470, P377, DOI 10.1007/BFb0057877
[6] der Wijngaart R.F.V., 2003, NAS03010 NASA
[7] Dummler J., 2013, PARCO, P733
[8] Duran A., 2005, P 19 ANN INT C SUPER, P121, DOI DOI 10.1145/1088149.1088166
[9] Interplay between Hardware Prefetcher and Page Eviction Policy in CPU-GPU Unified Virtual Memory
Ganguly, Debashis
Zhang, Ziyu
Yang, Jun
Melhem, Rami
[J]. PROCEEDINGS OF THE 2019 46TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA '19), 2019, : 224 - 235
[10] Giuntoli G., 2019, HYBRID CPU GPU FE2 M

← 1 2 3 →