Locality-aware Optimizations for Improving Remote Memory Latency in Multi-GPU Systems

被引:2
|
作者
Belayneh, Leul [1 ]
Ye, Haojie [1 ]
Chen, Kuan-Yu [1 ]
Blaauw, David [1 ]
Mudge, Trevor [1 ]
Dreslinski, Ronald [1 ]
Talati, Nishil [1 ]
机构
[1] Univ Michigan, Comp Sci & Engn, Ann Arbor, MI 48109 USA
来源
PROCEEDINGS OF THE 2022 31ST INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, PACT 2022 | 2022年
关键词
GPGPU; multi-GPU; data movement; GPU cache management; CACHE;
D O I
10.1145/3559009.3569649
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
With generational gains from transistor scaling, GPUs have been able to accelerate traditional computation-intensive workloads. But with the obsolescence of Moore's Law, single GPU systems are no longer able to satisfy the computational and memory requirements of emerging workloads. To remedy this, prior works have proposed tightly-coupled multi-GPU systems. However, multi-GPU systems are hampered from efficiently utilizing their compute resources due to the Non-Uniform Memory Access (NUMA) bottleneck. In this paper, we propose DualOpt, a lightweight hardware-only solution that reduces the remote memory access latency by delivering optimizations catered to a workload's locality profile. DualOpt uses the spatio-temporal locality of remote memory accesses as a metric to classify workloads as cache insensitive and cache-friendly. Cache insensitive workloads exhibit low spatio-temporal locality, while cache-friendly workloads have ample locality that is not exploited well by the conventional cache subsystem of the GPU. For cache insensitive workloads, DualOpt proposes a fine-granularity transfer of remote data instead of the conventional cache line transfer. These remote data are then coalesced so as to efficiently utilize inter-GPU bandwidth. For cache-friendly workloads, DualOpt adds a remote-only cache that can exploit locality in remote accesses. Finally, a decision engine automatically identifies the class of workload and delivers the corresponding optimization, which improves overall performance by 2.5x on a 4-GPU system, with a small hardware overhead of 0.032%.
引用
收藏
页码:304 / 316
页数:13
相关论文
共 50 条
  • [41] Acoustic scattering solver based on single level FMM for multi-GPU systems
    Lopez-Portugues, Miguel
    Lopez-Fernandez, Jesus A.
    Menendez-Canal, Jonatan
    Rodriguez-Campa, Alberto
    Ranilla, Jose
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2012, 72 (09) : 1057 - 1064
  • [42] Heterogeneous Computational Model for Landform Attributes Representation on Multicore and Multi-GPU Systems
    Boratto, Murilo
    Alonso, Pedro
    Ramiro, Carla
    Barreto, Marcos
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE, ICCS 2012, 2012, 9 : 47 - 56
  • [43] Improving the Performance of Cardiac Simulations in a Multi-GPU Architecture Using a Coalesced Data and Kernel Scheme
    Cordeiro, Raphael Pereira
    Oliveira, Rafael Sachetto
    dos Santos, Rodrigo Weber
    Lobosco, Marcelo
    ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, ICA3PP 2016, 2016, 10048 : 546 - 553
  • [44] Scaling up MapReduce-based Big Data Processing on Multi-GPU systems
    Jiang, Hai
    Chen, Yi
    Qiao, Zhi
    Weng, Tien-Hsiung
    Li, Kuan-Ching
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2015, 18 (01): : 369 - 383
  • [45] Scaling up MapReduce-based Big Data Processing on Multi-GPU systems
    Hai Jiang
    Yi Chen
    Zhi Qiao
    Tien-Hsiung Weng
    Kuan-Ching Li
    Cluster Computing, 2015, 18 : 369 - 383
  • [46] REC: Enhancing fine-grained cache coherence protocol in multi-GPU systems
    Ko, Gun
    Lee, Jiwon
    Kal, Hongju
    Lee, Hyunwuk
    Ro, Won Woo
    JOURNAL OF SYSTEMS ARCHITECTURE, 2025, 160
  • [47] Large scale water entry simulation with smoothed particle hydrodynamics on single- and multi-GPU systems
    Ji, Zhe
    Xu, Fei
    Takahashi, Akiyuki
    Sun, Yu
    COMPUTER PHYSICS COMMUNICATIONS, 2016, 209 : 1 - 12
  • [48] An optimal k-exclusion real-time locking protocol motivated by multi-GPU systems
    Elliott, Glenn A.
    Anderson, James H.
    REAL-TIME SYSTEMS, 2013, 49 (02) : 140 - 170
  • [49] An optimal k-exclusion real-time locking protocol motivated by multi-GPU systems
    Glenn A. Elliott
    James H. Anderson
    Real-Time Systems, 2013, 49 : 140 - 170
  • [50] A multi-GPU and CUDA-aware MPI-based spectral element formulation for ultrasonic wave propagation in solid media
    Li, Feilong
    Zou, Fangxin
    Rao, Jing
    ULTRASONICS, 2023, 134