pocl: A Performance-Portable OpenCL Implementation

被引：64

作者：

Jaaskelainen, Pekka ^{[1
]}

Sanchez de La Lama, Carlos ^{[2
]}

Schnetter, Erik ^{[3
,4
,5
]}

Raiskila, Kalle ^{[6
]}

Takala, Jarmo ^{[1
]}

Berg, Heikki ^{[6
]}

机构：

[1] Tampere Univ Technol, FIN-33101 Tampere, Finland

[2] Knowledge Dev POF, Madrid, Spain

[3] Perimeter Inst Theoret Phys, Waterloo, ON, Canada

[4] Univ Guelph, Dept Phys, Guelph, ON N1G 2W1, Canada

[5] Louisiana State Univ, Ctr Computat & Technol, Baton Rouge, LA 70803 USA

[6] Nokia Res Ctr, Espoo, Finland

来源：

INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING | 2015年 / 43卷 / 05期

基金：

美国国家科学基金会; 加拿大自然科学与工程研究理事会; 芬兰科学院;

关键词：

OpenCL; LLVM; GPGPU; VLIW; SIMD; Parallel programming; Heterogeneous platforms; Performance portability;

D O I：

10.1007/s10766-014-0320-y

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

OpenCL is a standard for parallel programming of heterogeneous systems. The benefits of a common programming standard are clear; multiple vendors can provide support for application descriptions written according to the standard, thus reducing the program porting effort. While the standard brings the obvious benefits of platform portability, the performance portability aspects are largely left to the programmer. The situation is made worse due to multiple proprietary vendor implementations with different characteristics, and, thus, required optimization strategies. In this paper, we propose an OpenCL implementation that is both portable and performance portable. At its core is a kernel compiler that can be used to exploit the data parallelism of OpenCL programs on multiple platforms with different parallel hardware styles. The kernel compiler is modularized to perform target-independent parallel region formation separately from the target-specific parallel mapping of the regions to enable support for various styles of fine-grained parallel resources such as subword SIMD extensions, SIMD datapaths and static multi-issue. Unlike previous similar techniques that work on the source level, the parallel region formation retains the information of the data parallelism using the LLVM IR and its metadata infrastructure. This data can be exploited by the later generic compiler passes for efficient parallelization. The proposed open source implementation of OpenCL is also platform portable, enabling OpenCL on a wide range of architectures, both already commercialized and on those that are still under research. The paper describes how the portability of the implementation is achieved. We test the two aspects to portability by utilizing the kernel compiler and the OpenCL implementation to run OpenCL applications in various platforms with different style of parallel resources. The results show that most of the benchmarked applications when compiled using pocl were faster or close to as fast as the best proprietary OpenCL implementation for the platform at hand.

引用

页码：752 / 785

页数：34

共 50 条

[41] Implementation of XcalableMP Device Acceleration Extention with OpenCL
Nomizu, Takuma
Takahashi, Daisuke
Lee, Jinpil
Boku, Taisuke
Sato, Mitsuhisa
2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS & PHD FORUM (IPDPSW), 2012, : 2394 - 2403
[42] Implementation of Autoencoders with Systolic Arrays through OpenCL
Gadea-Girones, Rafael
Herrero-Bosch, Vicente
Monzo-Ferrer, Jose
Colom-Palero, Ricardo
ELECTRONICS, 2021, 10 (01) : 1 - 20
[43] Multi-dimensional Homomorphisms and Their Implementation in OpenCL
Rasch, Ari
Gorlatch, Sergei
INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2018, 46 (01) : 101 - 119
[44] Multi-dimensional Homomorphisms and Their Implementation in OpenCL
Ari Rasch
Sergei Gorlatch
International Journal of Parallel Programming, 2018, 46 : 101 - 119
[45] An investigation of the performance portability of OpenCL
Pennycook, S. J.
Hammond, S. D.
Wright, S. A.
Herdman, J. A.
Miller, I.
Jarvis, S. A.
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2013, 73 (11) : 1439 - 1450
[46] OpenCL Implementation of Unsharp Filtering on GPU and FPGA
Unel, Ozge
Akgun, Toygar
2014 22ND SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2014, : 212 - 215
[47] Implementation of a Performance Optimized Database Join Operation on FPGA-GPU Platforms Using OpenCL
Roozmeh, Mehdi
Lavagno, Luciano
2017 IEEE NORDIC CIRCUITS AND SYSTEMS CONFERENCE (NORCAS): NORCHIP AND INTERNATIONAL SYMPOSIUM OF SYSTEM-ON-CHIP (SOC), 2017,
[48] Implementation of Sobel Edge Detection on FPGA based on OpenCL
You, Baoshan
Sheng, Weihua
Ma, Hongwei
Gu, Ye
Qin, Yinglin
2017 IEEE 7TH ANNUAL INTERNATIONAL CONFERENCE ON CYBER TECHNOLOGY IN AUTOMATION, CONTROL, AND INTELLIGENT SYSTEMS (CYBER), 2017, : 753 - 758
[49] High Performance Streaming Smith-Waterman Implementation with Implicit Synchronization on Intel FPGA using OpenCL
Houtgast, Ernst Joachim
Sima, Vlad-Mihai
Al-Ars, Zaid
2017 IEEE 17TH INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOENGINEERING (BIBE), 2017, : 492 - 496
[50] OpenCL Implementation of PSO Algorithm for the Quadratic Assignment Problem
Szwed, Piotr
Chmiel, Wojciech
Kadluczka, Piotr
ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING, PT II (ICAISC 2015), 2015, 9120 : 223 - 234

← 1 2 3 4 5 →