Locality-aware Optimizations for Improving Remote Memory Latency in Multi-GPU Systems

被引:2
|
作者
Belayneh, Leul [1 ]
Ye, Haojie [1 ]
Chen, Kuan-Yu [1 ]
Blaauw, David [1 ]
Mudge, Trevor [1 ]
Dreslinski, Ronald [1 ]
Talati, Nishil [1 ]
机构
[1] Univ Michigan, Comp Sci & Engn, Ann Arbor, MI 48109 USA
来源
PROCEEDINGS OF THE 2022 31ST INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, PACT 2022 | 2022年
关键词
GPGPU; multi-GPU; data movement; GPU cache management; CACHE;
D O I
10.1145/3559009.3569649
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
With generational gains from transistor scaling, GPUs have been able to accelerate traditional computation-intensive workloads. But with the obsolescence of Moore's Law, single GPU systems are no longer able to satisfy the computational and memory requirements of emerging workloads. To remedy this, prior works have proposed tightly-coupled multi-GPU systems. However, multi-GPU systems are hampered from efficiently utilizing their compute resources due to the Non-Uniform Memory Access (NUMA) bottleneck. In this paper, we propose DualOpt, a lightweight hardware-only solution that reduces the remote memory access latency by delivering optimizations catered to a workload's locality profile. DualOpt uses the spatio-temporal locality of remote memory accesses as a metric to classify workloads as cache insensitive and cache-friendly. Cache insensitive workloads exhibit low spatio-temporal locality, while cache-friendly workloads have ample locality that is not exploited well by the conventional cache subsystem of the GPU. For cache insensitive workloads, DualOpt proposes a fine-granularity transfer of remote data instead of the conventional cache line transfer. These remote data are then coalesced so as to efficiently utilize inter-GPU bandwidth. For cache-friendly workloads, DualOpt adds a remote-only cache that can exploit locality in remote accesses. Finally, a decision engine automatically identifies the class of workload and delivers the corresponding optimization, which improves overall performance by 2.5x on a 4-GPU system, with a small hardware overhead of 0.032%.
引用
收藏
页码:304 / 316
页数:13
相关论文
共 50 条
  • [1] Locality-aware Thread Block Design in Single and Multi-GPU Graph Processing
    Fan, Quan
    Chen, Zizhong
    2021 IEEE INTERNATIONAL CONFERENCE ON NETWORKING, ARCHITECTURE AND STORAGE (NAS), 2021, : 148 - 151
  • [2] Modelling Multi-GPU Systems
    Spampinato, Daniele G.
    Elster, Anne C.
    Natvig, Thorvald
    PARALLEL COMPUTING: FROM MULTICORES AND GPU'S TO PETASCALE, 2010, 19 : 562 - 569
  • [3] ScaleDNN: Data Movement Aware DNN Training on Multi-GPU
    Xu, Weizheng
    Pattnaik, Ashutosh
    Yuan, Geng
    Wang, Yanzhi
    Zhang, Youtao
    Tang, Xulong
    2021 IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER AIDED DESIGN (ICCAD), 2021,
  • [4] Multi-GPU System Design with Memory Networks
    Kim, Gwangsun
    Lee, Minseok
    Jeong, Jiyun
    Kim, John
    2014 47TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO), 2014, : 484 - 495
  • [5] Understanding Scalability of Multi-GPU Systems
    Feng, Yuan
    Jeon, Hyeran
    15TH WORKSHOP ON GENERAL PURPOSE PROCESSING USING GPU, GPGPU 2023, 2023, : 36 - 37
  • [6] Distributed texture memory in a Multi-GPU environment
    Moerschell, Adam
    Owens, John D.
    COMPUTER GRAPHICS FORUM, 2008, 27 (01) : 130 - 151
  • [7] A Locality-Aware Compression Scheme for Highly Reliable Embedded Systems
    Hong, Juhyung
    Kim, Jeongbin
    Han, Sangwoo
    Chung, Eui-Young
    IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2019, 38 (03) : 453 - 465
  • [8] Comprehensive techniques of multi-GPU memory optimization for deep learning acceleration
    Youngrang Kim
    Jaehwan Lee
    Jik-Soo Kim
    Hyunseung Jei
    Hongchan Roh
    Cluster Computing, 2020, 23 : 2193 - 2204
  • [9] Comprehensive techniques of multi-GPU memory optimization for deep learning acceleration
    Kim, Youngrang
    Lee, Jaehwan
    Kim, Jik-Soo
    Jei, Hyunseung
    Roh, Hongchan
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2020, 23 (03): : 2193 - 2204
  • [10] PARTANS: An Autotuning Framework for Stencil Computation on Multi-GPU Systems
    Lutz, Thibaut
    Fensch, Christian
    Cole, Murray
    ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2013, 9 (04)