Enhancing the Programmability and Performance Portability of GPU Tensor Operations

被引:6
作者
Mazaheri, Arya [1 ]
Schulte, Johannes [1 ]
Moskewicz, Matthew W. [2 ]
Wolf, Felix [1 ]
Jannesari, Ali [3 ]
机构
[1] Tech Univ Darmstadt, Darmstadt, Germany
[2] Deepscale Inc, Mountain View, CA USA
[3] Iowa State Univ, Ames, IA USA
来源
EURO-PAR 2019: PARALLEL PROCESSING | 2019年 / 11725卷
关键词
GPU; Deep learning; Performance portability; OPENCL; CUDA;
D O I
10.1007/978-3-030-29400-7_16
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Deep-learning models with convolutional networks are widely used for many artificial-intelligence tasks, thanks to the increasing adoption of high-throughput GPUs, even in mobile phones. CUDA and OpenCL are the two largely used programming interfaces for accessing the computing power of GPUs. However, attaining code portability has always been a challenge, until the introduction of the Vulkan API. Still, performance portability is not necessarily provided. In this paper, we investigate the unique characteristics of CUDA, OpenCL, and Vulkan kernels and propose a method for abstracting away syntactic differences. Such abstraction creates a single-source kernel which we use for generating code for each GPU programming interface. In addition, we expose auto-tuning parameters to further enhance performance portability. We implemented a selection of convolution operations, covering the core operations needed for deploying three common image-processing neural networks, and tuned them for NVIDIA, AMD, and ARM Mali GPUs. Our experiments show that we can generate deep-learning kernels with minimal effort for new platforms and achieve reasonable performance. Specifically, our Vulkan backend is able to provide competitive performance compared to vendor deep-learning libraries.
引用
收藏
页码:213 / 226
页数:14
相关论文
共 20 条
[1]  
Chen TQ, 2018, PROCEEDINGS OF THE 13TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P579
[2]  
Chetlur S, 2014, Arxiv, DOI [arXiv:1410.0759, 10.48550/arXiv.1410.0759]
[3]   A Comparative Study of SYCL, OpenCL, and OpenMP [J].
da Silva, Hercules Cardoso ;
Pisani, Flavia ;
Borin, Edson .
2016 28TH IEEE INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING WORKSHOPS (SBAC-PADW), 2016, :61-66
[4]   From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming [J].
Du, Peng ;
Weber, Rick ;
Luszczek, Piotr ;
Tomov, Stanimire ;
Peterson, Gregory ;
Dongarra, Jack .
PARALLEL COMPUTING, 2012, 38 (08) :391-407
[5]   DeepMon: Mobile GPU-based Deep Learning Framework for Continuous Vision Applications [J].
Huynh, Loc N. ;
Lee, Youngki ;
Balan, Rajesh Krishna .
MOBISYS'17: PROCEEDINGS OF THE 15TH ANNUAL INTERNATIONAL CONFERENCE ON MOBILE SYSTEMS, APPLICATIONS, AND SERVICES, 2017, :82-95
[6]  
Intel, 2019, PLAIDML
[7]  
Jianbin Fang, 2011, 2011 International Conference on Parallel Processing, P216, DOI 10.1109/ICPP.2011.45
[8]  
Karimi K, 2011, Arxiv, DOI arXiv:1005.2581
[9]   Bridging OpenCL and CUDA: A Comparative Analysis and Translation [J].
Kim, Junghyun ;
Dao, Thanh Tuan ;
Jung, Jaehoon ;
Joo, Jinyoung ;
Lee, Jaejin .
PROCEEDINGS OF SC15: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2015,
[10]  
Mammeri N, 2018, I S WORKL CHAR PROC, P25, DOI 10.1109/IISWC.2018.8573477