Locality-aware Optimizations for Improving Remote Memory Latency in Multi-GPU Systems

被引:2
|
作者
Belayneh, Leul [1 ]
Ye, Haojie [1 ]
Chen, Kuan-Yu [1 ]
Blaauw, David [1 ]
Mudge, Trevor [1 ]
Dreslinski, Ronald [1 ]
Talati, Nishil [1 ]
机构
[1] Univ Michigan, Comp Sci & Engn, Ann Arbor, MI 48109 USA
来源
PROCEEDINGS OF THE 2022 31ST INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, PACT 2022 | 2022年
关键词
GPGPU; multi-GPU; data movement; GPU cache management; CACHE;
D O I
10.1145/3559009.3569649
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
With generational gains from transistor scaling, GPUs have been able to accelerate traditional computation-intensive workloads. But with the obsolescence of Moore's Law, single GPU systems are no longer able to satisfy the computational and memory requirements of emerging workloads. To remedy this, prior works have proposed tightly-coupled multi-GPU systems. However, multi-GPU systems are hampered from efficiently utilizing their compute resources due to the Non-Uniform Memory Access (NUMA) bottleneck. In this paper, we propose DualOpt, a lightweight hardware-only solution that reduces the remote memory access latency by delivering optimizations catered to a workload's locality profile. DualOpt uses the spatio-temporal locality of remote memory accesses as a metric to classify workloads as cache insensitive and cache-friendly. Cache insensitive workloads exhibit low spatio-temporal locality, while cache-friendly workloads have ample locality that is not exploited well by the conventional cache subsystem of the GPU. For cache insensitive workloads, DualOpt proposes a fine-granularity transfer of remote data instead of the conventional cache line transfer. These remote data are then coalesced so as to efficiently utilize inter-GPU bandwidth. For cache-friendly workloads, DualOpt adds a remote-only cache that can exploit locality in remote accesses. Finally, a decision engine automatically identifies the class of workload and delivers the corresponding optimization, which improves overall performance by 2.5x on a 4-GPU system, with a small hardware overhead of 0.032%.
引用
收藏
页码:304 / 316
页数:13
相关论文
共 50 条
  • [21] Optimizing seam carving on multi-GPU systems for real-time content-aware image resizing
    Ikjoon Kim
    Jidong Zhai
    Yan Li
    Wenguang Chen
    The Journal of Supercomputing, 2015, 71 : 3500 - 3524
  • [22] Performance Analysis of Parallel FFT on Large Multi-GPU Systems
    Ayala, Alan
    Tomov, Stan
    Stoyanov, Miroslav
    Haidar, Azzam
    Dongarra, Jack
    2022 IEEE 36TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW 2022), 2022, : 372 - 381
  • [23] Optimizing seam carving on multi-GPU systems for real-time content-aware image resizing
    Kim, Ikjoon
    Zhai, Jidong
    Li, Yan
    Chen, Wenguang
    JOURNAL OF SUPERCOMPUTING, 2015, 71 (09) : 3500 - 3524
  • [24] Multi-GPU systems and Unified Virtual Memory for scientific applications: The case of the NAS multi-zone parallel benchmarks
    Gonzalez, Marc
    Morancho, Enric
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2021, 158 : 138 - 150
  • [25] Consumer Level Multi-GPU Systems Utilization, Efficiency, and Optimization
    Ross, John Brandon
    2013 PROCEEDINGS OF IEEE SOUTHEASTCON, 2013,
  • [26] GPU-Chariot: A Programming Framework for Stream Applications Running on Multi-GPU Systems
    Ino, Fumihiko
    Nakagawa, Shinta
    Hagihara, Kenichi
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2013, E96D (12): : 2604 - 2616
  • [27] Trans-FW: Short Circuiting Page Table Walk in Multi-GPU Systems via Remote Forwarding
    Li, Bingyao
    Yin, Jieming
    Holey, Anup
    Zhang, Youtao
    Yang, Jun
    Tang, Xulong
    2023 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE, HPCA, 2023, : 456 - 470
  • [28] Exploring Fine-Grained Task-based Execution on Multi-GPU Systems
    Chen, Long
    Villa, Oreste
    Gao, Guang R.
    2011 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2011, : 386 - 394
  • [29] SemCache plus plus : Semantics-Aware Caching for Efficient Multi-GPU Offloading
    Al-Saber, Nabeel
    Kulkarni, Milind
    ACM SIGPLAN NOTICES, 2015, 50 (08) : 255 - 256
  • [30] SemCache plus plus : Semantics-Aware Caching for Efficient Multi-GPU Offloading
    Al-Saber, Nabeel
    Kulkarni, Milind
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING (ICS'15), 2015, : 79 - 88