Revisiting Temporal Blocking Stencil Optimizations

被引:5
|
作者
Zhang, Lingqi [1 ,3 ]
Wahib, Mohamed [2 ]
Chen, Peng [2 ,3 ]
Meng, Jintao [4 ]
Wang, Xiao [5 ]
Endo, Toshio [1 ]
Matsuoka, Satoshi [1 ,2 ]
机构
[1] Tokyo Inst Technol, Tokyo, Japan
[2] RIKEN Ctr Computat Sci, Tokyo, Japan
[3] Natl Inst Adv Ind Sci & Technol, Tokyo, Japan
[4] Shenzhen Inst Adv Technol, Shenzhen, Peoples R China
[5] Oak Ridge Natl Lab, Oak Ridge, TN 37830 USA
来源
PROCEEDINGS OF THE 37TH INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, ACM ICS 2023 | 2023年
关键词
Stencil; Temporal Blocking Optimizations; GPU; GPU CODE; COMPUTATIONS; PARALLELISM;
D O I
10.1145/3577193.3593716
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Iterative stencils are used widely across the spectrum of High Performance Computing (HPC) applications. Many efforts have been put into optimizing stencil GPU kernels, given the prevalence of GPU-accelerated supercomputers. To improve the data locality, temporal blocking is an optimization that combines a batch of time steps to process them together. Under the observation that GPUs are evolving to resemble CPUs in some aspects, we revisit temporal blocking optimizations for GPUs. We explore how temporal blocking schemes can be adapted to the new features in the recent Nvidia GPUs, including large scratchpad memory, hardware prefetching, and device-wide synchronization. We propose a novel temporal blocking method, EBISU, which champions low device occupancy to drive aggressive deep temporal blocking on large tiles that are executed tile-by-tile. We compare EBISU with state-of-the-art temporal blocking libraries: STENCILGEN and AN5D. We also compare with state-of-the-art stencil auto-tuning tools that are equipped with temporal blocking optimizations: ARTEMIS and DRSTENCIL. Over a wide range of stencil benchmarks, EBISU achieves speedups up to 2.53x and a geometric mean speedup of 1.49x over the best state-of-the-art performance in each stencil benchmark.
引用
收藏
页码:251 / 263
页数:13
相关论文
共 14 条
  • [1] Efficient Stencil Computation with Temporal Blocking by Halide DSL
    Aikawa, Hiroki
    Endo, Toshio
    Yuki, Tomoya
    Hirofuchi, Takahiro
    Ikegami, Tsutomu
    2022 IEEE INTL CONF ON PARALLEL & DISTRIBUTED PROCESSING WITH APPLICATIONS, BIG DATA & CLOUD COMPUTING, SUSTAINABLE COMPUTING & COMMUNICATIONS, SOCIAL COMPUTING & NETWORKING, ISPA/BDCLOUD/SOCIALCOM/SUSTAINCOM, 2022, : 870 - 877
  • [2] AN5D: Automated Stencil Framework for High-Degree Temporal Blocking on GPUs
    Matsumura, Kazuaki
    Zohouri, Hamid Reza
    Wahib, Mohamed
    Endo, Toshio
    Matsuoka, Satoshi
    CGO'20: PROCEEDINGS OF THE18TH ACM/IEEE INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION, 2020, : 199 - 211
  • [3] Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL
    Zohouri, Hamid Reza
    Podobas, Artur
    Matsuoka, Satoshi
    PROCEEDINGS OF THE 2018 ACM/SIGDA INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE GATE ARRAYS (FPGA'18), 2018, : 153 - 162
  • [4] Tiling Optimizations for Stencil Computations Using Rewrite Rules in LIFT
    Stoltzfus, Larisa
    Hagedorn, Bastian
    Steuwer, Michel
    Gorlatch, Sergei
    Dubach, Christophe
    ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2019, 16 (04)
  • [5] A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations
    Jiayuan Meng
    Kevin Skadron
    International Journal of Parallel Programming, 2011, 39 : 115 - 142
  • [6] A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations
    Meng, Jiayuan
    Skadron, Kevin
    INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2011, 39 (01) : 115 - 142
  • [7] Revisiting split tiling for stencil computations in polyhedral compilation
    Li, Yingying
    Sun, Huihui
    Pang, Jianmin
    JOURNAL OF SUPERCOMPUTING, 2022, 78 (01): : 440 - 470
  • [8] Revisiting split tiling for stencil computations in polyhedral compilation
    Yingying Li
    Huihui Sun
    Jianmin Pang
    The Journal of Supercomputing, 2022, 78 : 440 - 470
  • [9] A Highly Efficient I/O-based Out-of-Core Stencil Algorithm with Globally Optimized Temporal Blocking
    Midorikawa, Hiroko
    Tan, Hideyuki
    2017 IEEE PACIFIC RIM CONFERENCE ON COMMUNICATIONS, COMPUTERS AND SIGNAL PROCESSING (PACRIM), 2017,
  • [10] Practical applicability of optimizations and performance models to complex stencil-based loop kernels in CFD
    Wichmann, Karl-Robert
    Kronbichler, Martin
    Loehner, Rainald
    Wall, Wolfgang A.
    INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2019, 33 (04): : 602 - 618