A Versatile Software Systolic Execution Model for GPU Memory-Bound Kernels

被引：11

作者：

Chen, Peng ^{[1
,2
]}

Wahib, Mohamed ^{[2
]}

Takizawa, Shinichiro ^{[2
]}

Takano, Ryousei ^{[3
]}

Matsuoka, Satoshi ^{[1
,4
]}

机构：

[1] Tokyo Inst Technol, Tokyo, Japan

[2] Natl Inst Adv Ind Sci & Technol, AIST Tokyo Tech Real World Big Data Computat Open, Tsukuba, Ibaraki, Japan

[3] Natl Inst Adv Ind Sci & Technol, Tsukuba, Ibaraki, Japan

[4] RIKEN Ctr Computat Sci, Kobe, Hyogo, Japan

来源：

PROCEEDINGS OF SC19: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS | 2019年

关键词：

Systolic Array; GPU; CUDA; Convolution; Stencil; OPTIMIZATION; ARRAYS; DESIGN;

D O I：

10.1145/3295500.3356162

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

This paper proposes a versatile high-performance execution model, inspired by systolic arrays, for memory-bound regular kernels running on CUDA-enabled GPUs. We formulate a systolic model that shifts partial sums by CUDA warp primitives for the computation. We also employ register files as a cache resource in order to operate the entire model efficiently. We demonstrate the effectiveness and versatility of the proposed model for a wide variety of stencil kernels that appear commonly in HPC, and also convolution kernels (increasingly important in deep learning workloads). Our algorithm outperforms the top reported state-of-the-art stencil implementations, including implementations with sophisticated temporal and spatial blocking techniques, on the two latest Nvidia architectures: Tesla V100 and P100. For 2D convolution of general filter sizes and shapes, our algorithm is on average 2.5x faster than Nvidia's NPP on V100 and P100 GPUs.

引用

页数：81

共 63 条

[1] A Predictive Model for Solving Small Linear Algebra Problems in GPU Registers [J].

Anderson, Michael J. ;

Sheffield, David ;

Keutzer, Kurt .

2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2012, :2-13

[2]

[Anonymous], 2010, SC10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, DOI [10.1109/SC.2010.2, DOI 10.1109/SC.2010.2]

[3]

[Anonymous], 2018, ARXIV180406826 CS

[4]

[Anonymous], 2013, GPR, DOI [DOI 10.1145/2400682.2400713, DOI 10.19476/J.YSXB.1004.0609.2013.09.014]

[5]

[Anonymous], 2017, CUDA C PROGRAMMING G

[6]

[Anonymous], 2014, ARXIV NEURAL EVOLUTI

[7]

[Anonymous], 2019, CUDA Toolkit Documentation

[8]

[Anonymous], 2011, P 4 WORKSH GEN PURP

[9]

[Anonymous], 2015, ArrayFire - A high performance software library for parallel computing with an easy-to-use API

[10]

[Anonymous], 2009, P 2 WORKSHOP GEN PUR

← 1 2 3 4 5 6 7 →