A compression-based memory-efficient optimization for out-of-core GPU stencil computation

被引：3

作者：

Shen, Jingcheng ^{[1
]}

Long, Linbo ^{[1
]}

Deng, Xin ^{[1
]}

Okita, Masao ^{[2
]}

Ino, Fumihiko ^{[2
]}

机构：

[1] Chongqing Univ Posts & Telecommun, Coll Comp Sci & Technol, 2 Chongwen Rd, Chongqing 400065, Peoples R China

[2] Osaka Univ, Grad Sch Informat Sci & Technol, 1-5 Yamadaoka, Suita, Osaka 5650871, Japan

来源：

JOURNAL OF SUPERCOMPUTING | 2023年 / 79卷 / 10期

基金：

中国国家自然科学基金; 日本学术振兴会;

关键词：

On-the-fly compression; Stencil computation; Out-of-core; GPU; LOSSY COMPRESSION; ALGORITHM;

D O I：

10.1007/s11227-023-05103-8

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

A code for out-of-core stencil computation manages data that exceeds the memory capacity of a GPU. However, such a code necessitates frequent data transfers between the CPU and GPU, which often impede overall performance. In this work, we propose a compression-based, memory-efficient method to accelerate out-of-core stencil codes. First, an on-the-fly compression technique is integrated into the out of-core computation to reduce CPU-GPU data transfers. Secondly, a single-working buffer strategy is employed to reduce the GPU memory usage, enabling more data to be stored on the GPU for reuse, resulting in increased temporal blocking steps. Experimental results demonstrate that the proposed method significantly reduces the GPU memory usage by 21%, thereby creating space for doubling the number of temporal blocking steps compared to the codes without compression. Our proposed method has shown to help the high-order, data-transfer-bound stencil codes achieve speedups up to 2.09x for single-precision floating-point format and up to 1.92x for double-precision floating-point format on an NVIDIA Tesla V100 GPU in comparison with the codes without compression.

引用

页码：11055 / 11077

页数：23

共 45 条

[1] Adams S, 2007, PROCEEDINGS OF THE HPCMP USERS GROUP CONFERENCE 2007, P334
[2] An out-of-core GPU approach for accelerating geostatistical interpolation
Allombert, Victor
Michea, David
Dupros, Fabrice
Bellier, Christian
Bourgine, Bernard
Aochi, Hideo
Jubertie, Sylvain
[J]. 2014 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE, 2014, 29 : 888 - 896
[3] Task offloading using GPU-based particle swarm optimization for high-performance vehicular edge computing
Alqarni, Mohamed A.
Mousa, Mohamed H.
Hussein, Mohamed K.
[J]. JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2022, 34 (10) : 10356 - 10364
[4] Exploring the feasibility of lossy compression for PDE simulations
Calhoun, Jon
Cappello, Franck
Olson, Luke N.
Snir, Marc
Gropp, William D.
[J]. INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2019, 33 (02) : 397 - 410
[5] Cappello F, 2020, SMOKY MOUNTAINS COMP, P99, DOI DOI 10.1007/978-3
[6] Accelerating Tensor Swapping in GPUs With Self-Tuning Compression
Chen, Ping
He, Shuibing
Zhang, Xuechen
Chen, Shuaiben
Hong, Peiyi
Yin, Yanlong
Sun, Xian-He
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (12) : 4484 - 4498
[7] Efficient Lossy Compression for Scientific Data Based on Pointwise Relative Error Bound
Di, Sheng
Tao, Dingwen
Liang, Xin
Cappello, Franck
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2019, 30 (02) : 331 - 345
[8] Farres A, 2019, 81 EAGE C EXHIBITION, V2019, P1
[9] Parallel border tracking in binary images using GPUs
Garcia-Molla, Victor M.
Alonso-Jorda, Pedro
Garcia-Laguia, Ricardo
[J]. JOURNAL OF SUPERCOMPUTING, 2022, 78 (07) : 9817 - 9839
[10] Taskflow: A General-Purpose Parallel and Heterogeneous Task Programming System
Huang, Tsung-Wei
Lin, Dian-Lun
Lin, Yibo
Lin, Chun-Xun
[J]. IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2022, 41 (05) : 1448 - 1452

← 1 2 3 4 5 →