Retargeting and Respecializing GPU Workloads for Performance Portability

被引:1
作者
Ivanov, Ivan R. [1 ]
Zinenko, Oleksandr [2 ]
Domke, Jens [3 ]
Endo, Toshio [4 ]
Moses, William S. [5 ]
机构
[1] Tokyo Inst Technol, RIKEN RCCS, Tokyo, Japan
[2] Google DeepMind, Paris, France
[3] RIKEN R CCS, Kobe, Hyogo, Japan
[4] Tokyo Inst Technol, Tokyo, Japan
[5] Univ Illinois, Google DeepMind, Champaign, IL USA
来源
2024 IEEE/ACM INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION, CGO | 2024年
关键词
D O I
10.1109/CGO57630.2024.10444828
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In order to come close to peak performance, accelerators like GPUs require significant architecture-specific tuning that understand the availability of shared memory, parallelism, tensor cores, etc. Unfortunately, the pursuit of higher performance and lower costs have led to a significant diversification of architecture designs, even from the same vendor. This creates the need for performance portability across different GPUs, especially important for programs in a particular programming model with a certain architecture in mind. Even when the program can be seamlessly executed on a different architecture, it may suffer a performance penalty due to it not being sized appropriately to the available hardware resources such as fast memory and registers, let alone not using newer advanced features of the architecture. We propose a new approach to improving performance of (legacy) CUDA programs for modern machines by automatically adjusting the amount of work each parallel thread does, and the amount of memory and register resources it requires. By operating within the MLIR compiler infrastructure, we are able to also target AMD GPUs by performing automatic translation from CUDA and simultaneously adjust the program granularity to fit the size of target GPUs. Combined with autotuning assisted by the platform-specific compiler, our approach demonstrates 27% geomean speedup on the Rodinia benchmark suite over baseline CUDA implementation as well as performance parity between similar NVIDIA and AMD GPUs executing the same CUDA program.
引用
收藏
页码:119 / 132
页数:14
相关论文
共 50 条
[31]   Characterizing Convolutional Neural Network Workloads on a Detailed GPU Simulator [J].
Chang, Kwanghee ;
Kim, Minsik ;
Kim, Kyungah ;
Ro, Won Woo .
PROCEEDINGS INTERNATIONAL SOC DESIGN CONFERENCE 2017 (ISOCC 2017), 2017, :84-85
[32]   Evaluating On-Node GPU Interconnects for Deep Learning Workloads [J].
Tallent, Nathan R. ;
Gawande, Nitin A. ;
Siegel, Charles ;
Vishnu, Abhinav ;
Hoisie, Adolfy .
HIGH PERFORMANCE COMPUTING SYSTEMS: PERFORMANCE MODELING, BENCHMARKING, AND SIMULATION (PMBS 2017), 2018, 10724 :3-21
[33]   Reliability of Large Scale GPU Clusters for Deep Learning Workloads [J].
Qian, Junjie ;
Kim, Taeyoon ;
Jeon, Myeongjae .
WEB CONFERENCE 2021: COMPANION OF THE WORLD WIDE WEB CONFERENCE (WWW 2021), 2021, :179-181
[34]   Accelerating Broadcast Communication with GPU Compression for Deep Learning Workloads [J].
Zhou, Qinghua ;
Anthony, Quentin ;
Shafi, Aamir ;
Subramoni, Hari ;
Panda, Dhabaleswar K. .
2022 IEEE 29TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS, HIPC, 2022, :22-31
[35]   Whippletree: Task-based Scheduling of Dynamic Workloads on the GPU [J].
Steinberger, Markus ;
Kenzel, Michael ;
Boechat, Pedro ;
Kerbl, Bernhard ;
Dokter, Mark ;
Schmalstieg, Dieter .
ACM TRANSACTIONS ON GRAPHICS, 2014, 33 (06)
[36]   Dynamic GPU Energy Optimization for Machine Learning Training Workloads [J].
Wang, Farui ;
Zhang, Weizhe ;
Lai, Shichao ;
Hao, Meng ;
Wang, Zheng .
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (11) :2943-2954
[37]   GPU Memory Reallocation Techniques in Fully Homomorphic Encryption Workloads [J].
Choi, Jake ;
Jung, Sunchul ;
Yeom, Heonyoung .
39TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, SAC 2024, 2024, :1525-1532
[38]   Analyzing Machine Learning Workloads Using a Detailed GPU Simulator [J].
Lew, Jonathan ;
Shah, Deval A. ;
Pati, Suchita ;
Cattell, Shaylin ;
Zhang, Mengchi ;
Sandhupatla, Amruth ;
Ng, Christopher ;
Goli, Negar ;
Sinclair, Matthew D. ;
Rogers, Timothy G. ;
Aamodt, Tor M. .
2019 IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE (ISPASS), 2019, :151-152
[39]   Tensor Processing Primitives: A Programming Abstraction for Efficiency and Portability in Deep Learning and HPC Workloads [J].
Georganas, Evangelos ;
Kalamkar, Dhiraj ;
Avancha, Sasikanth ;
Adelman, Menachem ;
Aggarwal, Deepti ;
Anderson, Cristina ;
Breuer, Alexander ;
Bruestle, Jeremy ;
Chaudhary, Narendra ;
Kundu, Abhisek ;
Kutnick, Denise ;
Laub, Frank ;
Md, Vasimuddin ;
Misra, Sanchit ;
Mohanty, Ramanarayan ;
Pabst, Hans ;
Retford, Brian ;
Ziv, Barukh ;
Heinecke, Alexander .
FRONTIERS IN APPLIED MATHEMATICS AND STATISTICS, 2022, 8
[40]   Providing Source Code Level Portability Between CPU and GPU with MapCG [J].
洪春涛 ;
陈德颢 ;
陈羽北 ;
陈文光 ;
郑纬民 ;
林海波 .
Journal of Computer Science & Technology, 2012, 27 (01) :42-56