A compression-based memory-efficient optimization for out-of-core GPU stencil computation

被引：3

作者：

Shen, Jingcheng ^{[1
]}

Long, Linbo ^{[1
]}

Deng, Xin ^{[1
]}

Okita, Masao ^{[2
]}

Ino, Fumihiko ^{[2
]}

机构：

[1] Chongqing Univ Posts & Telecommun, Coll Comp Sci & Technol, 2 Chongwen Rd, Chongqing 400065, Peoples R China

[2] Osaka Univ, Grad Sch Informat Sci & Technol, 1-5 Yamadaoka, Suita, Osaka 5650871, Japan

来源：

JOURNAL OF SUPERCOMPUTING | 2023年 / 79卷 / 10期

基金：

中国国家自然科学基金; 日本学术振兴会;

关键词：

On-the-fly compression; Stencil computation; Out-of-core; GPU; LOSSY COMPRESSION; ALGORITHM;

D O I：

10.1007/s11227-023-05103-8

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

A code for out-of-core stencil computation manages data that exceeds the memory capacity of a GPU. However, such a code necessitates frequent data transfers between the CPU and GPU, which often impede overall performance. In this work, we propose a compression-based, memory-efficient method to accelerate out-of-core stencil codes. First, an on-the-fly compression technique is integrated into the out of-core computation to reduce CPU-GPU data transfers. Secondly, a single-working buffer strategy is employed to reduce the GPU memory usage, enabling more data to be stored on the GPU for reuse, resulting in increased temporal blocking steps. Experimental results demonstrate that the proposed method significantly reduces the GPU memory usage by 21%, thereby creating space for doubling the number of temporal blocking steps compared to the codes without compression. Our proposed method has shown to help the high-order, data-transfer-bound stencil codes achieve speedups up to 2.09x for single-precision floating-point format and up to 1.92x for double-precision floating-point format on an NVIDIA Tesla V100 GPU in comparison with the codes without compression.

引用

页码：11055 / 11077

页数：23

共 45 条

[41] Roofline: An Insightful Visual Performance Model for Multicore Architectures
Williams, Samuel
Waterman, Andrew
Patterson, David
[J]. COMMUNICATIONS OF THE ACM, 2009, 52 (04) : 65 - 76
[42] Full-State Quantum Circuit Simulation by Using Data Compression
Wu, Xin-Chuan
Di, Sheng
Dasgupta, Emma Maitreyee
Cappello, Franck
Finkel, Hal
Alexeev, Yuri
Chong, Frederic T.
[J]. PROCEEDINGS OF SC19: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2019,
[43] Zakirov Andrey V., 2022, HPCCT 2022: Proceedings of the 2022 6th High Performance Computing and Cluster Technologies Conference (HPCCT), P51, DOI 10.1145/3560442.3560450
[44] Zeidan M, 2015, EGSR EI I, P41
[45] Designing High-Performance MPI Libraries with On-the-fly Compression for Modern GPU Clusters
Zhou, Q.
Chu, C.
Kumar, N. S.
Kousha, P.
Ghazimirsaeed, S. M.
Subramoni, H.
Panda, D. K.
[J]. 2021 IEEE 35TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2021, : 444 - 453

← 1 2 3 4 5 →