Automated Partitioning of Data-Parallel Kernels using Polyhedral Compilation

被引：0

作者：

Matz, Alexander ^{[1
]}

Doerfert, Johannes ^{[2
]}

Froening, Holger ^{[3
]}

机构：

[1] IMC Trading BV, Amsterdam, Netherlands

[2] Saarland Univ, Saarbrucken, Germany

[3] Heidelberg Univ, Inst Comp Engn, Heidelberg, Germany

来源：

49TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING WORKSHOP PROCEEDINGS, ICPP 2020 | 2020年

关键词：

Multi-GPU; Polyhedral Compilation; LLVM; Static Analysis; Code Generation; GPU Communication; Runtime Systems;

D O I：

10.1145/3409390.3409403

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

GPUs are well-established in domains outside of computer graphics, including scientific computing, artificial intelligence, data warehousing, and other computationally intensive areas. Their execution model is based on a thread hierarchy and suggests that GPU workloads can generally be safely partitioned along the boundaries of thread blocks. However, the most efficient partitioning strategy is highly dependent on the application's memory access patterns, and usually a tedious task for programmers in terms of decision and implementation. We leverage this observation for a concept that automatically compiles single-GPU code to multi-GPU applications. We present the idea and a prototype implementation of this concept and validate both on a selection of benchmarks. In particular, we illustrate our use of 1) polyhedral compilation to model memory accesses, 2) a runtime library to track GPU buffers and identify stale data, 3) IR transformations for the partitioning of GPU kernels, and 4) a custom preprocessor that rewrites CUDA host code to utilize multiple GPUs. This work focuses on applications with regular access patterns on global memory and the toolchain to fully automatically compile CUDA applications without requiring any user intervention. Our benchmarks compare single-device CUDAbinaries produced by NVIDIA's reference compiler to binaries produced for multiple GPUs using our toolchain. We report speedups of up to 12.4x for 16 Kepler-class GPUs.

引用

页数：10

共 32 条

[1]

Arunkumar Akhil, 2017, SIGARCH Comput. Archit. News, V45, P2

[2] Memory Access Patterns: The Missing Piece of the Multi-GPU Puzzle [J].

Ben-Nun, Tal ;

Levy, Ely ;

Barak, Amnon ;

Rubin, Eri .

PROCEEDINGS OF SC15: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2015,

[3] Compiling Affine Loop Nests for Distributed-Memory Parallel Architectures [J].

Bondhugula, Uday .

2013 INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC), 2013,

[4] A Practical Automatic Polyhedral Parallelizer and Locality Optimizer [J].

Bondhugula, Uday ;

Hartono, Albert ;

Ramanujam, J. ;

Sadayappan, R. .

PLDI'08: PROCEEDINGS OF THE 2008 SIGPLAN CONFERENCE ON PROGRAMMING LANGUAGE DESIGN & IMPLEMENTATION, 2008, :101-+

[5]

Catanzaro B. C., 2006, Tech. Rep. UCB/EECS-2006-183

[6]

Catanzaro B, 2011, ACM SIGPLAN NOTICES, V46, P47, DOI 10.1145/1941553.1941562

[7]

Dean J, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE SIXTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDE '04), P137

[8]

Doerfert J, 2017, INT SYM CODE GENER, P292, DOI 10.1109/CGO.2017.7863748

[9]

Doerfert Johannes, 2017, 2017 US LLVM DEV M

[10]

Duato Jose, 2010, P 2010 INT C HIGH PE

← 1 2 3 4 →