From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming

被引:154
作者
Du, Peng [1 ]
Weber, Rick [1 ]
Luszczek, Piotr [1 ]
Tomov, Stanimire [1 ]
Peterson, Gregory [1 ]
Dongarra, Jack [1 ,2 ]
机构
[1] Univ Tennessee, Knoxville, TN 37996 USA
[2] Univ Manchester, Manchester M13 9PL, Lancs, England
基金
美国国家科学基金会;
关键词
Hardware accelerators; Portability; Auto-tuning; ALGORITHMS; SOFTWARE;
D O I
10.1016/j.parco.2011.10.002
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In this work, we evaluate OpenCL as a programming tool for developing performance-portable applications for GPGPU. While the Khronos group developed OpenCL with programming portability in mind, performance is not necessarily portable. OpenCL has required performance-impacting initializations that do not exist in other languages such as CUDA. Understanding these implications allows us to provide a single library with decent performance on a variety of platforms. We choose triangular solver (TRSM) and matrix multiplication (GEMM) as representative level 3 BLAS routines to implement in OpenCL We profile TRSM to get the time distribution of the OpenCL runtime system. We then provide tuned GEMM kernels for both the NVIDIA Tesla C2050 and ATI Radeon 5870, the latest GPUs offered by both companies. We explore the benefits of using the texture cache, the performance ramifications of copying data into images, discrepancies in the OpenCL and CUDA compilers' optimizations, and other issues that affect the performance. Experimental results show that nearly 50% of peak performance can be obtained in GEMM on both GPUs in OpenCL We also show that the performance of these kernels is not highly portable. Finally, we propose the use of auto-tuning to better explore these kernels' parameter space using search harness. (C) 2011 Elsevier B.V. All rights reserved.
引用
收藏
页码:391 / 407
页数:17
相关论文
共 32 条
  • [1] ANDERSON E., 1999, LAPACK USERSGUIDE, V3rd
  • [2] [Anonymous], 2009, OPENCL JUMPSTART GUI
  • [3] [Anonymous], 227 LAPACK
  • [4] [Anonymous], NVIDIA CUDA PROGR GU
  • [5] [Anonymous], TESL C2050 C2070 GPU
  • [6] ATI, 2010, ATI STREAM COMP OPEN
  • [7] ATI, 2010, ATI STREAM SOFTW DEV
  • [8] BILMES J, 1997, INT C SUP, P340, DOI DOI 10.1145/263580.263662
  • [9] Self-adapting linear algebra algorithms and software
    Demmel, J
    Dongarra, J
    Eijkhout, V
    Fuentes, E
    Petitet, A
    Vuduc, R
    Whaley, RC
    Yelick, K
    [J]. PROCEEDINGS OF THE IEEE, 2005, 93 (02) : 293 - 312
  • [10] Du P., 2010, P 1 INT WORKSH PAR S