An Automated Tool for Analysis and Tuning of GPU-Accelerated Code in HPC Applications

被引：4

作者：

Zhou, Keren ^{[1
]}

Meng, Xiaozhu ^{[1
]}

Sai, Ryuichi ^{[1
]}

Grubisic, Dejan ^{[1
]}

Mellor-Crummey, John ^{[1
]}

机构：

[1] Rice Univ, Comp Sci Dept, Houston, TX 77054 USA

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2022年 / 33卷 / 04期

关键词：

Graphics processing units; Optimization; Tools; Measurement; Instruments; Tuning; Registers; High performance computing; performance analysis; parallel programming; parallel architectures;

D O I：

10.1109/TPDS.2021.3094169

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

The US Department of Energy's fastest supercomputers and forthcoming exascale systems employ Graphics Processing Units (GPUs) to increase the computational performance of compute nodes. However, the complexity of GPU architectures makes tailoring sophisticated applications to achieve high performance on GPU-accelerated systems a major challenge. At best, prior performance tools for GPU code only provide coarse-grained tuning advice at the kernel level. In this article, we describe GPA, a performance advisor that suggests potential code optimizations at a hierarchy of levels, including individual lines, loops, and functions. To gather the fine-grained measurements needed to produce such insights, GPA uses instruction sampling and binary instrumentation to monitor execution of GPU code. At the time of this writing, GPU instruction sampling is only available on NVIDIA GPUs. To understand performance losses, GPA uses data flow analysis to approximately attribute measured instruction stalls back to their causes. GPA then analyzes patterns of stalls using information about a program's structure and the GPU architecture to identify optimization strategies that address inefficiencies observed. GPA then employs detailed performance models to estimate the potential speedup that each optimization might provide. Experiments with benchmarks and applications show that GPA provides useful advice for tuning GPU code. We applied GPA to analyze and tune a collection of codes on NVIDIA V100 and A100 GPUs. GPA suggested optimizations that it estimates will accelerate performance across the set of codes by a geometric mean of 1.21x. Applying these optimizations suggested by GPA accelerated these codes by a geometric mean of 1.19x.

引用

页码：854 / 865

页数：12

共 50 条

[1] A Tool for Performance Analysis of GPU-Accelerated Applications
Zhou, Keren
Mellor-Crummey, John
PROCEEDINGS OF THE 2019 IEEE/ACM INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION (CGO '19), 2019, : 282 - 282
[2] A Tool for Bottleneck Analysis and Performance Prediction for GPU-accelerated Applications
Madougou, Souley
Varbanescu, Ana Lucia
de Laat, Cees
van Nieuwpoort, Rob
2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2016, : 641 - 652
[3] A tool for top-down performance analysis of GPU-accelerated applications
Zhou, Keren
Krentel, Mark
Mellor-Crummey, John
PROCEEDINGS OF THE 25TH ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING (PPOPP '20), 2020, : 415 - 416
[4] Measurement and analysis of GPU-accelerated applications with HPCToolkit
Zhou, Keren
Adhianto, Laksono
Anderson, Jonathon
Cherian, Aaron
Grubisic, Dejan
Krentel, Mark
Liu, Yumeng
Meng, Xiaozhu
Mellor-Crummey, John
PARALLEL COMPUTING, 2021, 108
[5] A hybrid solution method for CFD applications on GPU-accelerated hybrid HPC platforms
Liu, Xiaocheng
Zhong, Ziming
Xu, Kai
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2016, 56 : 759 - 765
[6] Detecting Anomalous Computation with RNNs on GPU-Accelerated HPC Machines
Zou, Pengfei
Li, Ang
Barker, Kevin
Ge, Rong
PROCEEDINGS OF THE 49TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2020, 2020,
[7] Fingerprinting Anomalous Computation with RNN for GPU-accelerated HPC Machines
Zou, Pengfei
Li, Ang
Barker, Kevin
Ge, Rong
PROCEEDINGS OF THE 2019 IEEE INTERNATIONAL SYMPOSIUM ON WORKLOAD CHARACTERIZATION (IISWC 2019), 2019, : 253 - 256
[8] On the Portability of GPU-Accelerated Applications via Automated Source-to-Source Translation
Sathre, Paul
Gardner, Mark
Feng, Wu-chun
PROCEEDINGS OF INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING IN ASIA-PACIFIC REGION (HPC ASIA 2019), 2019, : 1 - 8
[9] A GPU-Accelerated automated multilevel substructuring method for modal analysis of structures
Wang, Guidong
Wang, Yujie
Chen, Zeyu
Wang, Feiqi
Li, She
Cui, Xiangyang
COMPUTERS & STRUCTURES, 2024, 305
[10] Estimating the WCET of GPU-Accelerated Applications using Hybrid Analysis
Betts, Adam
Donaldson, Alastair
PROCEEDINGS OF THE 2013 25TH EUROMICRO CONFERENCE ON REAL-TIME SYSTEMS (ECRTS 2013), 2013, : 193 - 202

← 1 2 3 4 5 →