FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs

被引:32
|
作者
Papakonstantinou, Alexandros [1 ]
Gururaj, Karthik [2 ]
Stratton, John A. [1 ]
Chen, Deming [1 ]
Cong, Jason [2 ]
Hwu, Wen-Mei W. [1 ]
机构
[1] Univ Illinois, Dept Elect & Comp Engn, Urbana, IL 61801 USA
[2] Univ Calif Los Angeles, Dept Comp Sci, Los Angeles, CA USA
来源
2009 IEEE 7TH SYMPOSIUM ON APPLICATION SPECIFIC PROCESSORS (SASP 2009) | 2009年
关键词
D O I
10.1109/SASP.2009.5226333
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
As growing power dissipation and thermal effects disrupted the rising clock frequency trend and threatened to annul Moore's law, the computing industry has switched its route to higher performance through parallel processing. The rise of multi-core systems in all domains of computing has opened the door to heterogeneous multi-processors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs and FPGAs are becoming very popular in PC-based heterogeneous systems for speeding up compute intensive kernels of scientific, imaging and simulation applications. GPUs can execute hundreds of concurrent threads, while FPGAs provide customized concurrency for highly parallel kernels. However, exploiting the parallelism available in these applications is currently not a push-button task. Often the programmer has to expose the application's fine and coarse grained parallelism by using special APIs. CUDA is such a parallel-computing API that is driven by the GPU industry and is gaining significant popularity. In this work, we adapt the CUDA programming model into a new FPGA design flow called FCUDA, which efficiently maps the coarse and fine grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool which enables high-abstraction FPGA programming. FCUDA is based on a source-to-source compilation that transforms the SPMD CUDA thread blocks into parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the resulting customized FPGA multi-core accelerators. To the best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDA programming model for high-performance computing in FPGAs.
引用
收藏
页码:35 / +
页数:2
相关论文
共 17 条
  • [1] Efficient Compilation of CUDA Kernels for High-Performance Computing on FPGAs
    Papakonstantinou, Alexandros
    Gururaj, Karthik
    Stratton, John A.
    Chen, Deming
    Cong, Jason
    Hwu, Wen-Mei W.
    ACM TRANSACTIONS ON EMBEDDED COMPUTING SYSTEMS, 2013, 13 (02)
  • [2] Efficient NAS Parallel Benchmark Kernels with CUDA
    de Araujo, Gabriell Alves
    Griebler, Dalvan
    Danelutto, Marco
    Fernandes, Luiz Gustavo
    2020 28TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING (PDP 2020), 2020, : 9 - 16
  • [3] A Dataflow IR for Memory Efficient RIPL Compilation to FPGAs
    Stewart, Robert
    Michaelson, Greg
    Bhowmik, Deepayan
    Garcia, Paulo
    Wallace, Andy
    ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, ICA3PP 2016 COLLOCATED WORKSHOPS, 2016, 10049 : 174 - 188
  • [4] FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow
    Chen, Yao
    Gurumani, Swathi T.
    Liang, Yun
    Li, Guofeng
    Guo, Donghui
    Rupnow, Kyle
    Chen, Deming
    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2016, 24 (06) : 2220 - 2233
  • [5] CUDA-For-Clusters: A System for Efficient Execution of CUDA Kernels on Multi-core Clusters
    Prabhakar, Raghu
    Govindarajan, R.
    Thazhuthaveetil, Matthew J.
    EURO-PAR 2012 PARALLEL PROCESSING, 2012, 7484 : 415 - 426
  • [6] Enabling the CUDA Unified Memory model in Edge, Cloud and HPC offloaded GPU kernels
    Montella, Raffaele
    Di Luccio, Diana
    De Vita, Ciro Giuseppe
    Mellone, Gennaro
    Lapegna, Marco
    Laccetti, Giuliano
    Kosta, Sokol
    Giunta, Giulio
    2022 22ND IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2022), 2022, : 834 - 841
  • [7] MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs
    Stratton, John A.
    Stone, Sam S.
    Hwu, Wen-mei W.
    LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING, 2008, 5335 : 16 - +
  • [8] Efficient mapping of dimensionality reduction designs onto heterogeneous FPGAs
    Bouganis, Christos-S.
    Pournara, Iosifina
    Cheung, Peter Y. K.
    FCCM 2007: 15TH ANNUAL IEEE SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES, PROCEEDINGS, 2007, : 141 - +
  • [9] From Capabilities to Regions: Enabling Efficient Compilation of Lexical Effect Handlers
    Mueller, Marius
    Schuster, Philipp
    Starup, Jonathan Lindegaard
    Ostermann, Klaus
    Braechthauser, Jonathan Immanuel
    PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES-PACMPL, 2023, 7 (OOPSLA):
  • [10] TSTC: Enabling Efficient Training via Structured Sparse Tensor Compilation
    Huang, Shiyuan
    Liu, Fangxin
    Li, Tian
    Wang, Zongwu
    Li, Haomin
    Jiang, Li
    29TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE, ASP-DAC 2024, 2024, : 884 - 889