Enhancing the Programmability and Performance Portability of GPU Tensor Operations

被引：6

作者：

Mazaheri, Arya ^{[1
]}

Schulte, Johannes ^{[1
]}

Moskewicz, Matthew W. ^{[2
]}

Wolf, Felix ^{[1
]}

Jannesari, Ali ^{[3
]}

机构：

[1] Tech Univ Darmstadt, Darmstadt, Germany

[2] Deepscale Inc, Mountain View, CA USA

[3] Iowa State Univ, Ames, IA USA

来源：

EURO-PAR 2019: PARALLEL PROCESSING | 2019年 / 11725卷

关键词：

GPU; Deep learning; Performance portability; OPENCL; CUDA;

D O I：

10.1007/978-3-030-29400-7_16

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Deep-learning models with convolutional networks are widely used for many artificial-intelligence tasks, thanks to the increasing adoption of high-throughput GPUs, even in mobile phones. CUDA and OpenCL are the two largely used programming interfaces for accessing the computing power of GPUs. However, attaining code portability has always been a challenge, until the introduction of the Vulkan API. Still, performance portability is not necessarily provided. In this paper, we investigate the unique characteristics of CUDA, OpenCL, and Vulkan kernels and propose a method for abstracting away syntactic differences. Such abstraction creates a single-source kernel which we use for generating code for each GPU programming interface. In addition, we expose auto-tuning parameters to further enhance performance portability. We implemented a selection of convolution operations, covering the core operations needed for deploying three common image-processing neural networks, and tuned them for NVIDIA, AMD, and ARM Mali GPUs. Our experiments show that we can generate deep-learning kernels with minimal effort for new platforms and achieve reasonable performance. Specifically, our Vulkan backend is able to provide competitive performance compared to vendor deep-learning libraries.

引用

页码：213 / 226

页数：14

共 20 条

[1]

Chen TQ, 2018, PROCEEDINGS OF THE 13TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P579

[2]

Chetlur S, 2014, Arxiv, DOI [arXiv:1410.0759, 10.48550/arXiv.1410.0759]

[3] A Comparative Study of SYCL, OpenCL, and OpenMP [J].

da Silva, Hercules Cardoso ;

Pisani, Flavia ;

Borin, Edson .

2016 28TH IEEE INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING WORKSHOPS (SBAC-PADW), 2016, :61-66

[4] From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming [J].

Du, Peng ;

Weber, Rick ;

Luszczek, Piotr ;

Tomov, Stanimire ;

Peterson, Gregory ;

Dongarra, Jack .

PARALLEL COMPUTING, 2012, 38 (08) :391-407

[5] DeepMon: Mobile GPU-based Deep Learning Framework for Continuous Vision Applications [J].

Huynh, Loc N. ;

Lee, Youngki ;

Balan, Rajesh Krishna .

MOBISYS'17: PROCEEDINGS OF THE 15TH ANNUAL INTERNATIONAL CONFERENCE ON MOBILE SYSTEMS, APPLICATIONS, AND SERVICES, 2017, :82-95

[6]

Intel, 2019, PLAIDML

[7]

Jianbin Fang, 2011, 2011 International Conference on Parallel Processing, P216, DOI 10.1109/ICPP.2011.45

[8]

Karimi K, 2011, Arxiv, DOI arXiv:1005.2581

[9] Bridging OpenCL and CUDA: A Comparative Analysis and Translation [J].

Kim, Junghyun ;

Dao, Thanh Tuan ;

Jung, Jaehoon ;

Joo, Jinyoung ;

Lee, Jaejin .

PROCEEDINGS OF SC15: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2015,

[10]

Mammeri N, 2018, I S WORKL CHAR PROC, P25, DOI 10.1109/IISWC.2018.8573477

← 1 2 →