An Automated Tool for Analysis and Tuning of GPU-Accelerated Code in HPC Applications

被引:4
|
作者
Zhou, Keren [1 ]
Meng, Xiaozhu [1 ]
Sai, Ryuichi [1 ]
Grubisic, Dejan [1 ]
Mellor-Crummey, John [1 ]
机构
[1] Rice Univ, Comp Sci Dept, Houston, TX 77054 USA
关键词
Graphics processing units; Optimization; Tools; Measurement; Instruments; Tuning; Registers; High performance computing; performance analysis; parallel programming; parallel architectures;
D O I
10.1109/TPDS.2021.3094169
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The US Department of Energy's fastest supercomputers and forthcoming exascale systems employ Graphics Processing Units (GPUs) to increase the computational performance of compute nodes. However, the complexity of GPU architectures makes tailoring sophisticated applications to achieve high performance on GPU-accelerated systems a major challenge. At best, prior performance tools for GPU code only provide coarse-grained tuning advice at the kernel level. In this article, we describe GPA, a performance advisor that suggests potential code optimizations at a hierarchy of levels, including individual lines, loops, and functions. To gather the fine-grained measurements needed to produce such insights, GPA uses instruction sampling and binary instrumentation to monitor execution of GPU code. At the time of this writing, GPU instruction sampling is only available on NVIDIA GPUs. To understand performance losses, GPA uses data flow analysis to approximately attribute measured instruction stalls back to their causes. GPA then analyzes patterns of stalls using information about a program's structure and the GPU architecture to identify optimization strategies that address inefficiencies observed. GPA then employs detailed performance models to estimate the potential speedup that each optimization might provide. Experiments with benchmarks and applications show that GPA provides useful advice for tuning GPU code. We applied GPA to analyze and tune a collection of codes on NVIDIA V100 and A100 GPUs. GPA suggested optimizations that it estimates will accelerate performance across the set of codes by a geometric mean of 1.21x. Applying these optimizations suggested by GPA accelerated these codes by a geometric mean of 1.19x.
引用
收藏
页码:854 / 865
页数:12
相关论文
共 50 条
  • [1] A Tool for Performance Analysis of GPU-Accelerated Applications
    Zhou, Keren
    Mellor-Crummey, John
    PROCEEDINGS OF THE 2019 IEEE/ACM INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION (CGO '19), 2019, : 282 - 282
  • [2] A Tool for Bottleneck Analysis and Performance Prediction for GPU-accelerated Applications
    Madougou, Souley
    Varbanescu, Ana Lucia
    de Laat, Cees
    van Nieuwpoort, Rob
    2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2016, : 641 - 652
  • [3] A tool for top-down performance analysis of GPU-accelerated applications
    Zhou, Keren
    Krentel, Mark
    Mellor-Crummey, John
    PROCEEDINGS OF THE 25TH ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING (PPOPP '20), 2020, : 415 - 416
  • [4] Measurement and analysis of GPU-accelerated applications with HPCToolkit
    Zhou, Keren
    Adhianto, Laksono
    Anderson, Jonathon
    Cherian, Aaron
    Grubisic, Dejan
    Krentel, Mark
    Liu, Yumeng
    Meng, Xiaozhu
    Mellor-Crummey, John
    PARALLEL COMPUTING, 2021, 108
  • [5] A hybrid solution method for CFD applications on GPU-accelerated hybrid HPC platforms
    Liu, Xiaocheng
    Zhong, Ziming
    Xu, Kai
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2016, 56 : 759 - 765
  • [6] Detecting Anomalous Computation with RNNs on GPU-Accelerated HPC Machines
    Zou, Pengfei
    Li, Ang
    Barker, Kevin
    Ge, Rong
    PROCEEDINGS OF THE 49TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2020, 2020,
  • [7] Fingerprinting Anomalous Computation with RNN for GPU-accelerated HPC Machines
    Zou, Pengfei
    Li, Ang
    Barker, Kevin
    Ge, Rong
    PROCEEDINGS OF THE 2019 IEEE INTERNATIONAL SYMPOSIUM ON WORKLOAD CHARACTERIZATION (IISWC 2019), 2019, : 253 - 256
  • [8] On the Portability of GPU-Accelerated Applications via Automated Source-to-Source Translation
    Sathre, Paul
    Gardner, Mark
    Feng, Wu-chun
    PROCEEDINGS OF INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING IN ASIA-PACIFIC REGION (HPC ASIA 2019), 2019, : 1 - 8
  • [9] A GPU-Accelerated automated multilevel substructuring method for modal analysis of structures
    Wang, Guidong
    Wang, Yujie
    Chen, Zeyu
    Wang, Feiqi
    Li, She
    Cui, Xiangyang
    COMPUTERS & STRUCTURES, 2024, 305
  • [10] Estimating the WCET of GPU-Accelerated Applications using Hybrid Analysis
    Betts, Adam
    Donaldson, Alastair
    PROCEEDINGS OF THE 2013 25TH EUROMICRO CONFERENCE ON REAL-TIME SYSTEMS (ECRTS 2013), 2013, : 193 - 202