Retargeting and Respecializing GPU Workloads for Performance Portability

被引:0
|
作者
Ivanov, Ivan R. [1 ]
Zinenko, Oleksandr [2 ]
Domke, Jens [3 ]
Endo, Toshio [4 ]
Moses, William S. [5 ]
机构
[1] Tokyo Inst Technol, RIKEN RCCS, Tokyo, Japan
[2] Google DeepMind, Paris, France
[3] RIKEN R CCS, Kobe, Hyogo, Japan
[4] Tokyo Inst Technol, Tokyo, Japan
[5] Univ Illinois, Google DeepMind, Champaign, IL USA
来源
2024 IEEE/ACM INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION, CGO | 2024年
关键词
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In order to come close to peak performance, accelerators like GPUs require significant architecture-specific tuning that understand the availability of shared memory, parallelism, tensor cores, etc. Unfortunately, the pursuit of higher performance and lower costs have led to a significant diversification of architecture designs, even from the same vendor. This creates the need for performance portability across different GPUs, especially important for programs in a particular programming model with a certain architecture in mind. Even when the program can be seamlessly executed on a different architecture, it may suffer a performance penalty due to it not being sized appropriately to the available hardware resources such as fast memory and registers, let alone not using newer advanced features of the architecture. We propose a new approach to improving performance of (legacy) CUDA programs for modern machines by automatically adjusting the amount of work each parallel thread does, and the amount of memory and register resources it requires. By operating within the MLIR compiler infrastructure, we are able to also target AMD GPUs by performing automatic translation from CUDA and simultaneously adjust the program granularity to fit the size of target GPUs. Combined with autotuning assisted by the platform-specific compiler, our approach demonstrates 27% geomean speedup on the Rodinia benchmark suite over baseline CUDA implementation as well as performance parity between similar NVIDIA and AMD GPUs executing the same CUDA program.
引用
收藏
页码:119 / 132
页数:14
相关论文
共 50 条
  • [1] Performance Portability of a GPU Enabled Factorization with the DAGuE Framework
    Bosilca, George
    Bouteiller, Aurelien
    Herault, Thomas
    Lemarinier, Pierre
    Saengpatsa, Narapat Ohm
    Tomov, Stanimire
    Dongarra, Jack J.
    2011 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2011, : 395 - 402
  • [2] Enhancing the Programmability and Performance Portability of GPU Tensor Operations
    Mazaheri, Arya
    Schulte, Johannes
    Moskewicz, Matthew W.
    Wolf, Felix
    Jannesari, Ali
    EURO-PAR 2019: PARALLEL PROCESSING, 2019, 11725 : 213 - 226
  • [3] Taking GPU Programming Models to Task for Performance Portability
    Davis, Joshua H.
    Sivaraman, Pranav
    Kitson, Joy
    Parasyris, Konstantinos
    Menon, Harshitha
    Minn, Isaac
    Georgakoudis, Giorgis
    Bhatele, Abhinav
    arXiv,
  • [4] Understanding and Optimizing GPU Cache Memory Performance for Compute Workloads
    Choo, Kyoshin
    Panlener, William
    Jang, Byunghyun
    2014 IEEE 13TH INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED COMPUTING (ISPDC), 2014, : 189 - 196
  • [5] Accelerating Performance of GPU-based Workloads Using CXL
    Arif, Moiz
    Maurya, Avinash
    Rafique, M. Mustafa
    PROCEEDINGS OF THE 13TH WORKSHOP ON AI AND SCIENTIFIC COMPUTING AT SCALE USING FLEXIBLE COMPUTING INFRASTRUCTURES, FLEXSCIENCE 2023, 2023, : 27 - 31
  • [6] Performance Portability Study of Epistasis Detection using SYCL on NVIDIA GPU
    Jin, Zheming
    Vetter, Jeffrey S.
    13TH ACM INTERNATIONAL CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND HEALTH INFORMATICS, BCB 2022, 2022,
  • [7] Studying performance portability of LAMMPS across diverse GPU-based platforms
    Hagerty, Nick
    Vergara, Veronica G. Melesse
    Tharrington, Arnold
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2023, 35 (28):
  • [8] On a Simplified Approach to Achieve Parallel Performance and Portability Across CPU and GPU Architectures
    Morgan, Nathaniel
    Yenusah, Caleb
    Diaz, Adrian
    Dunning, Daniel
    Moore, Jacob
    Heilman, Erin
    Roth, Calvin
    Lieberman, Evan
    Walton, Steven
    Brown, Sarah
    Holladay, Daniel
    Knezevic, Marko
    Whetstone, Gavin
    Baker, Zachary
    Robey, Robert
    INFORMATION, 2024, 15 (11)
  • [9] GPU Support for Batch Oriented Workloads
    Costa, Lauro B.
    Al-Kiswany, Samer
    Ripeanu, Matei
    2009 IEEE 28TH INTERNATIONAL PERFORMANCE COMPUTING AND COMMUNICATIONS CONFERENCE (IPCC 2009), 2009, : 231 - 238
  • [10] VitBit: Enhancing Embedded GPU Performance for AI Workloads through Register Operand
    Jeon, Jaebeom
    Gil, Minseong
    Kim, Junsu
    Park, Jaeyong
    Koo, Gunjae
    Yoon, Myung Kuk
    Oh, Yunho
    53RD INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2024, 2024, : 1012 - 1021