pocl: A Performance-Portable OpenCL Implementation

被引:64
|
作者
Jaaskelainen, Pekka [1 ]
Sanchez de La Lama, Carlos [2 ]
Schnetter, Erik [3 ,4 ,5 ]
Raiskila, Kalle [6 ]
Takala, Jarmo [1 ]
Berg, Heikki [6 ]
机构
[1] Tampere Univ Technol, FIN-33101 Tampere, Finland
[2] Knowledge Dev POF, Madrid, Spain
[3] Perimeter Inst Theoret Phys, Waterloo, ON, Canada
[4] Univ Guelph, Dept Phys, Guelph, ON N1G 2W1, Canada
[5] Louisiana State Univ, Ctr Computat & Technol, Baton Rouge, LA 70803 USA
[6] Nokia Res Ctr, Espoo, Finland
基金
美国国家科学基金会; 加拿大自然科学与工程研究理事会; 芬兰科学院;
关键词
OpenCL; LLVM; GPGPU; VLIW; SIMD; Parallel programming; Heterogeneous platforms; Performance portability;
D O I
10.1007/s10766-014-0320-y
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
OpenCL is a standard for parallel programming of heterogeneous systems. The benefits of a common programming standard are clear; multiple vendors can provide support for application descriptions written according to the standard, thus reducing the program porting effort. While the standard brings the obvious benefits of platform portability, the performance portability aspects are largely left to the programmer. The situation is made worse due to multiple proprietary vendor implementations with different characteristics, and, thus, required optimization strategies. In this paper, we propose an OpenCL implementation that is both portable and performance portable. At its core is a kernel compiler that can be used to exploit the data parallelism of OpenCL programs on multiple platforms with different parallel hardware styles. The kernel compiler is modularized to perform target-independent parallel region formation separately from the target-specific parallel mapping of the regions to enable support for various styles of fine-grained parallel resources such as subword SIMD extensions, SIMD datapaths and static multi-issue. Unlike previous similar techniques that work on the source level, the parallel region formation retains the information of the data parallelism using the LLVM IR and its metadata infrastructure. This data can be exploited by the later generic compiler passes for efficient parallelization. The proposed open source implementation of OpenCL is also platform portable, enabling OpenCL on a wide range of architectures, both already commercialized and on those that are still under research. The paper describes how the portability of the implementation is achieved. We test the two aspects to portability by utilizing the kernel compiler and the OpenCL implementation to run OpenCL applications in various platforms with different style of parallel resources. The results show that most of the benchmarked applications when compiled using pocl were faster or close to as fast as the best proprietary OpenCL implementation for the platform at hand.
引用
收藏
页码:752 / 785
页数:34
相关论文
共 50 条
  • [1] pocl: A Performance-Portable OpenCL Implementation
    Pekka Jääskeläinen
    Carlos Sánchez de La Lama
    Erik Schnetter
    Kalle Raiskila
    Jarmo Takala
    Heikki Berg
    International Journal of Parallel Programming, 2015, 43 : 752 - 785
  • [2] Developing Performance-Portable Molecular Dynamics Kernels in OpenCL
    Pennycook, S. J.
    Jarvis, S. A.
    2012 SC COMPANION: HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SCC), 2012, : 386 - 395
  • [3] PPOpenCL: A Performance-Portable OpenCL Compiler with Host and Kernel Thread Code Fusion
    Liu, Ying
    Huang, Lei
    Wu, Mingchuan
    Cui, Huimin
    Lv, Fang
    Feng, Xiaobing
    Xue, Jingling
    PROCEEDINGS OF THE 28TH INTERNATIONAL CONFERENCE ON COMPILER CONSTRUCTION (CC '19), 2019, : 2 - 16
  • [4] Performance-Portable Autotuning of OpenCL Kernels for Convolutional Layers of Deep Neural Networks
    Tsai, Yaohung M.
    Luszczek, Piotr
    Kurzak, Jakub
    Dongarra, Jack
    PROCEEDINGS OF 2016 2ND WORKSHOP ON MACHINE LEARNING IN HPC ENVIRONMENTS (MLHPC), 2016, : 9 - 18
  • [5] From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming
    Du, Peng
    Weber, Rick
    Luszczek, Piotr
    Tomov, Stanimire
    Peterson, Gregory
    Dongarra, Jack
    PARALLEL COMPUTING, 2012, 38 (08) : 391 - 407
  • [6] Vectorized and performance-portable quicksort
    Wassenberg, Jan
    Blacher, Mark
    Giesen, Joachim
    Sanders, Peter
    SOFTWARE-PRACTICE & EXPERIENCE, 2022, 52 (12): : 2684 - 2699
  • [7] kEDM: A Performance-portable Implementation of Empirical Dynamic Modeling using Kokkos
    Takahashi, Keichi
    Watanakeesuntorn, Wassapon
    Ichikawa, Kohei
    Park, Joseph
    Takano, Ryousei
    Haga, Jason
    Sugihara, George
    Pao, Gerald M.
    PRACTICE AND EXPERIENCE IN ADVANCED RESEARCH COMPUTING 2021, PEARC 2021, 2021,
  • [8] Writing a performance-portable matrix multiplication
    Fabeiro, Jorge F.
    Andrade, Diego
    Fraguela, Basilio B.
    PARALLEL COMPUTING, 2016, 52 : 65 - 77
  • [9] Performance-portable Binary Neutron Star Mergers with AthenaK
    Fields, Jacob
    Zhu, Hengrui
    Radice, David
    Stone, James M.
    Cook, William
    Bernuzzi, Sebastiano
    Daszuta, Boris
    ASTROPHYSICAL JOURNAL SUPPLEMENT SERIES, 2025, 276 (02):
  • [10] Performance-Portable Graph Coarsening for Efficient Multilevel Graph Analysis
    Gilbert, Michael S.
    Acer, Seher
    Boman, Erik G.
    Madduri, Kamesh
    Rajamanickam, Sivasankaran
    2021 IEEE 35TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2021, : 213 - 222