FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs

被引：32

作者：

Papakonstantinou, Alexandros ^{[1
]}

Gururaj, Karthik ^{[2
]}

Stratton, John A. ^{[1
]}

Chen, Deming ^{[1
]}

Cong, Jason ^{[2
]}

Hwu, Wen-Mei W. ^{[1
]}

机构：

[1] Univ Illinois, Dept Elect & Comp Engn, Urbana, IL 61801 USA

[2] Univ Calif Los Angeles, Dept Comp Sci, Los Angeles, CA USA

来源：

2009 IEEE 7TH SYMPOSIUM ON APPLICATION SPECIFIC PROCESSORS (SASP 2009) | 2009年

关键词：

D O I：

10.1109/SASP.2009.5226333

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

As growing power dissipation and thermal effects disrupted the rising clock frequency trend and threatened to annul Moore's law, the computing industry has switched its route to higher performance through parallel processing. The rise of multi-core systems in all domains of computing has opened the door to heterogeneous multi-processors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs and FPGAs are becoming very popular in PC-based heterogeneous systems for speeding up compute intensive kernels of scientific, imaging and simulation applications. GPUs can execute hundreds of concurrent threads, while FPGAs provide customized concurrency for highly parallel kernels. However, exploiting the parallelism available in these applications is currently not a push-button task. Often the programmer has to expose the application's fine and coarse grained parallelism by using special APIs. CUDA is such a parallel-computing API that is driven by the GPU industry and is gaining significant popularity. In this work, we adapt the CUDA programming model into a new FPGA design flow called FCUDA, which efficiently maps the coarse and fine grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool which enables high-abstraction FPGA programming. FCUDA is based on a source-to-source compilation that transforms the SPMD CUDA thread blocks into parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the resulting customized FPGA multi-core accelerators. To the best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDA programming model for high-performance computing in FPGAs.

引用

页码：35 / +

页数：2

共 17 条

[1] Efficient Compilation of CUDA Kernels for High-Performance Computing on FPGAs
Papakonstantinou, Alexandros
Gururaj, Karthik
Stratton, John A.
Chen, Deming
Cong, Jason
Hwu, Wen-Mei W.
ACM TRANSACTIONS ON EMBEDDED COMPUTING SYSTEMS, 2013, 13 (02)
[2] Efficient NAS Parallel Benchmark Kernels with CUDA
de Araujo, Gabriell Alves
Griebler, Dalvan
Danelutto, Marco
Fernandes, Luiz Gustavo
2020 28TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING (PDP 2020), 2020, : 9 - 16
[3] A Dataflow IR for Memory Efficient RIPL Compilation to FPGAs
Stewart, Robert
Michaelson, Greg
Bhowmik, Deepayan
Garcia, Paulo
Wallace, Andy
ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, ICA3PP 2016 COLLOCATED WORKSHOPS, 2016, 10049 : 174 - 188
[4] FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow
Chen, Yao
Gurumani, Swathi T.
Liang, Yun
Li, Guofeng
Guo, Donghui
Rupnow, Kyle
Chen, Deming
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2016, 24 (06) : 2220 - 2233
[5] CUDA-For-Clusters: A System for Efficient Execution of CUDA Kernels on Multi-core Clusters
Prabhakar, Raghu
Govindarajan, R.
Thazhuthaveetil, Matthew J.
EURO-PAR 2012 PARALLEL PROCESSING, 2012, 7484 : 415 - 426
[6] Enabling the CUDA Unified Memory model in Edge, Cloud and HPC offloaded GPU kernels
Montella, Raffaele
Di Luccio, Diana
De Vita, Ciro Giuseppe
Mellone, Gennaro
Lapegna, Marco
Laccetti, Giuliano
Kosta, Sokol
Giunta, Giulio
2022 22ND IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2022), 2022, : 834 - 841
[7] MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs
Stratton, John A.
Stone, Sam S.
Hwu, Wen-mei W.
LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING, 2008, 5335 : 16 - +
[8] Efficient mapping of dimensionality reduction designs onto heterogeneous FPGAs
Bouganis, Christos-S.
Pournara, Iosifina
Cheung, Peter Y. K.
FCCM 2007: 15TH ANNUAL IEEE SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES, PROCEEDINGS, 2007, : 141 - +
[9] From Capabilities to Regions: Enabling Efficient Compilation of Lexical Effect Handlers
Mueller, Marius
Schuster, Philipp
Starup, Jonathan Lindegaard
Ostermann, Klaus
Braechthauser, Jonathan Immanuel
PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES-PACMPL, 2023, 7 (OOPSLA):
[10] TSTC: Enabling Efficient Training via Structured Sparse Tensor Compilation
Huang, Shiyuan
Liu, Fangxin
Li, Tian
Wang, Zongwu
Li, Haomin
Jiang, Li
29TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE, ASP-DAC 2024, 2024, : 884 - 889

← 1 2 →