Retargeting and Respecializing GPU Workloads for Performance Portability

被引:0
作者
Ivanov, Ivan R. [1 ]
Zinenko, Oleksandr [2 ]
Domke, Jens [3 ]
Endo, Toshio [4 ]
Moses, William S. [5 ]
机构
[1] Tokyo Inst Technol, RIKEN RCCS, Tokyo, Japan
[2] Google DeepMind, Paris, France
[3] RIKEN R CCS, Kobe, Hyogo, Japan
[4] Tokyo Inst Technol, Tokyo, Japan
[5] Univ Illinois, Google DeepMind, Champaign, IL USA
来源
2024 IEEE/ACM INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION, CGO | 2024年
关键词
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In order to come close to peak performance, accelerators like GPUs require significant architecture-specific tuning that understand the availability of shared memory, parallelism, tensor cores, etc. Unfortunately, the pursuit of higher performance and lower costs have led to a significant diversification of architecture designs, even from the same vendor. This creates the need for performance portability across different GPUs, especially important for programs in a particular programming model with a certain architecture in mind. Even when the program can be seamlessly executed on a different architecture, it may suffer a performance penalty due to it not being sized appropriately to the available hardware resources such as fast memory and registers, let alone not using newer advanced features of the architecture. We propose a new approach to improving performance of (legacy) CUDA programs for modern machines by automatically adjusting the amount of work each parallel thread does, and the amount of memory and register resources it requires. By operating within the MLIR compiler infrastructure, we are able to also target AMD GPUs by performing automatic translation from CUDA and simultaneously adjust the program granularity to fit the size of target GPUs. Combined with autotuning assisted by the platform-specific compiler, our approach demonstrates 27% geomean speedup on the Rodinia benchmark suite over baseline CUDA implementation as well as performance parity between similar NVIDIA and AMD GPUs executing the same CUDA program.
引用
收藏
页码:119 / 132
页数:14
相关论文
共 50 条
  • [21] Portability efficiency approach for calculating performance portability
    Marowka, Ami
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2025, 170
  • [22] Analyzing CUDA Workloads Using a Detailed GPU Simulator
    Bakhoda, Ali
    Yuan, George L.
    Fung, Wilson W. L.
    Wong, Henry
    Aamodt, Tor M.
    ISPASS 2009: IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE, 2009, : 163 - 174
  • [23] Exploration of GPU sharing policies under GEMM workloads
    Oroutzoglou, Ioannis
    Masouros, Dimosthenis
    Koliogeorgi, Konstantina
    Xydis, Sotirios
    Soudris, Dimitrios
    PROCEEDINGS OF THE 23RD INTERNATIONAL WORKSHOP ON SOFTWARE AND COMPILERS FOR EMBEDDED SYSTEMS (SCOPES 2020), 2020, : 66 - 69
  • [24] Understanding of GPU Architectural Vulnerability for Deep Learning Workloads
    Santoso, Danny
    Jeon, Hyeran
    2019 IEEE INTERNATIONAL SYMPOSIUM ON DEFECT AND FAULT TOLERANCE IN VLSI AND NANOTECHNOLOGY SYSTEMS (DFT), 2019,
  • [25] Optimizing Deep Learning Workloads on ARM GPU with TVM
    Zheng, Lianmin
    Chen, Tianqi
    1ST ACM REQUEST WORKSHOP/TOURNAMENT ON REPRODUCIBLE SOFTWARE/HARDWARE CO-DESIGN OF PARETO-EFFICIENT DEEP LEARNING, 2018,
  • [26] Tensor Processing Primitives: A Programming Abstraction for Efficiency and Portability in Deep Learning Workloads
    Georganas, Evangelos
    Kalamkar, Dhiraj
    Avancha, Sasikanth
    Adelman, Menachem
    Anderson, Cristina
    Breuer, Alexander
    Bruestle, Jeremy
    Chaudhary, Narendra
    Kundu, Abhisek
    Kutnick, Denise
    Laub, Frank
    Md, Vasimuddin
    Misra, Sanchit
    Mohanty, Ramanarayan
    Pabst, Hans
    Ziv, Barukh
    Heinecke, Alexander
    SC21: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2021,
  • [27] GPU-Initiated Resource Allocation for Irregular Workloads
    Turimbetov, Ilyas
    Sasongko, Muhammad Aditya
    Unat, Didem
    PROCEEDINGS OF 2024 3RD INTERNATIONAL WORKSHOP ON EXTREME HETEROGENEITY SOLUTIONS, EXHET 2024, 2024, : 1 - 8
  • [28] Effective Performance Portability
    Harrell, Stephen Lien
    Kitson, Joy
    Bird, Robert
    Pennycook, Simon John
    Sewall, Jason
    Jacobsen, Doug
    Asanza, David Neill
    Hsu, Abigail
    Cabada, Hector Carrillo
    Kim, Heesoo
    Robey, Robert
    PROCEEDINGS OF 2018 IEEE/ACM INTERNATIONAL WORKSHOP ON PERFORMANCE, PORTABILITY AND PRODUCTIVITY IN HPC (P3HPC 2018), 2018, : 24 - 36
  • [29] Provision and use of GPU resources for distributed workloads via the Grid
    Traynor, Daniel
    Froy, Terry
    24TH INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY AND NUCLEAR PHYSICS (CHEP 2019), 2020, 245
  • [30] Characterizing Convolutional Neural Network Workloads on a Detailed GPU Simulator
    Chang, Kwanghee
    Kim, Minsik
    Kim, Kyungah
    Ro, Won Woo
    PROCEEDINGS INTERNATIONAL SOC DESIGN CONFERENCE 2017 (ISOCC 2017), 2017, : 84 - 85