A Versatile Software Systolic Execution Model for GPU Memory-Bound Kernels

被引：11

作者：

Chen, Peng ^{[1
,2
]}

Wahib, Mohamed ^{[2
]}

Takizawa, Shinichiro ^{[2
]}

Takano, Ryousei ^{[3
]}

Matsuoka, Satoshi ^{[1
,4
]}

机构：

[1] Tokyo Inst Technol, Tokyo, Japan

[2] Natl Inst Adv Ind Sci & Technol, AIST Tokyo Tech Real World Big Data Computat Open, Tsukuba, Ibaraki, Japan

[3] Natl Inst Adv Ind Sci & Technol, Tsukuba, Ibaraki, Japan

[4] RIKEN Ctr Computat Sci, Kobe, Hyogo, Japan

来源：

PROCEEDINGS OF SC19: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS | 2019年

关键词：

Systolic Array; GPU; CUDA; Convolution; Stencil; OPTIMIZATION; ARRAYS; DESIGN;

D O I：

10.1145/3295500.3356162

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

This paper proposes a versatile high-performance execution model, inspired by systolic arrays, for memory-bound regular kernels running on CUDA-enabled GPUs. We formulate a systolic model that shifts partial sums by CUDA warp primitives for the computation. We also employ register files as a cache resource in order to operate the entire model efficiently. We demonstrate the effectiveness and versatility of the proposed model for a wide variety of stencil kernels that appear commonly in HPC, and also convolution kernels (increasingly important in deep learning workloads). Our algorithm outperforms the top reported state-of-the-art stencil implementations, including implementations with sophisticated temporal and spatial blocking techniques, on the two latest Nvidia architectures: Tesla V100 and P100. For 2D convolution of general filter sizes and shapes, our algorithm is on average 2.5x faster than Nvidia's NPP on V100 and P100 GPUs.

引用

页数：81

共 63 条

[11] A View of the Parallel Computing Landscape [J].

Asanovic, Krste ;

Bodik, Rastislav ;

Demmel, James ;

Keaveny, Tony ;

Keutzer, Kurt ;

Kubiatowicz, John ;

Morgan, Nelson ;

Patterson, David ;

Sen, Koushik ;

Wawrzynek, John ;

Wessel, David ;

Yelick, Katherine .

COMMUNICATIONS OF THE ACM, 2009, 52 (10) :56-67

[12] Compiler-Directed Transformation for Higher-Order Stencils [J].

Basu, Protonu ;

Hall, Mary ;

Williams, Samuel ;

Van Straalen, Brian ;

Oliker, Leonid ;

Colella, Phillip .

2015 IEEE 29TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2015, :313-323

[13]

Ben-Nun Tal, 2019, ARXIV190210345

[14]

Ben-Sasson Eli, 2016, P 2016 INT C SUP ICS

[15]

Benabderrahmane MW, 2010, LECT NOTES COMPUT SC, V6011, P283, DOI 10.1007/978-3-642-11970-5_16

[16]

Bhatnagar H., 2002, ADV ASIC CHIP SYNTHE

[17] Efficient Algorithms for the Summed Area Tables Primitive on GPUs [J].

Chen, Peng ;

Wahib, Mohamed ;

Takizawa, Shinichiro ;

Takano, Ryousei ;

Matsuoka, Satoshi .

2018 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2018, :482-493

[18] A novel systolic array structure for DCT [J].

Cheng, C ;

Parhi, KK .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2005, 52 (07) :366-369

[19] Implementation of the DWT in a GPU through a Register-based Strategy [J].

Enfedaque, Pablo ;

Auli-Llinas, Francesc ;

Moure, Juan C. .

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2015, 26 (12) :3394-3406

[20] Code generation in the polytope model [J].

Griebl, M ;

Lengauer, C ;

Wetzel, S .

1998 INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, PROCEEDINGS, 1998, :106-111

← 1 2 3 4 5 6 7 →