BATMAN: Techniques for Maximizing System Bandwidth of Memory Systems with Stacked-DRAM

被引:25
作者
Chou, Chiachen [1 ]
Jaleel, Aamer [2 ]
Qureshi, Moinuddin [1 ]
机构
[1] Georgia Inst Technol, Sch ECE, Atlanta, GA 30332 USA
[2] NVIDIA, NVIDIA Res, Santa Clara, CA USA
来源
MEMSYS 2017: PROCEEDINGS OF THE INTERNATIONAL SYMPOSIUM ON MEMORY SYSTEMS | 2017年
基金
美国国家科学基金会;
关键词
POLICIES;
D O I
10.1145/3132402.3132404
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Tiered-memory systems consist of high-bandwidth 3D-DRAM and high-capacity commodity-DRAM. Conventional designs attempt to improve system performance by maximizing the number of memory accesses serviced by 3D-DRAM. However, when the commodity-DRAM bandwidth is a significant fraction of overall system bandwidth, the techniques ineficiently utilize the total bandwidth offered by the tiered-memory system and yields suboptimal performance. In such situations, the performance can be improved by distributing memory accesses that are proportional to the bandwidth of each memory. Ideally, we want a simple and effective runtime mechanism that achieves the desired access distribution without requiring significant hardware or software support. This paper proposes Bandwidth-Aware Tiered-Memory Management (BATMAN), a runtime mechanism that manages the distribution of memory accesses in a tiered-memory system by explicitly controlling data movement. BATMAN monitors the number of accesses to both memories, and when the number of 3D-DRAM accesses exceeds the desired threshold, BATMAN disallows data movement from the commodity-DRAM to 3D-DRAM and proactively moves data from 3D-DRAM to commodity-DRAM. We demonstrate BATMAN on systems that architect the 3D-DRAM as either a hardware-managed cache (cache mode) or a part of the OS-visible memory space (flat mode). Our evaluations on a system with 4GB 3D-DRAM and 32GB commodity-DRAM show that BATMAN improves performance by an average of 11% and 10% and energy-delay product by 13% and 11% for systems in the cache and flat modes, respectively. BATMAN incurs only an eight-byte hardware overhead and requires negligible software modification.
引用
收藏
页码:268 / 280
页数:13
相关论文
共 47 条
  • [31] Pin: Building customized program analysis tools with dynamic instrumentation
    Luk, CK
    Cohn, R
    Muth, R
    Patil, H
    Klauser, A
    Lowney, G
    Wallace, S
    Reddi, VJ
    Hazelwood, K
    [J]. ACM SIGPLAN NOTICES, 2005, 40 (06) : 190 - 200
  • [32] McCalpin J. D., 1991, STREAM SUSTAINABLE M
  • [33] Meswani M.R., 2015, HIGH PERF COMP ARCH
  • [34] Micron, 2014, HMC GEN2
  • [35] Micron, 2012, CALC DDR MEM SYST PO
  • [36] Micron, 2010, 1GB DDR3 SDRAM
  • [37] NVIDIA, 2014, NVIDIA PASC
  • [38] Perelman E., 2003, Performance Evaluation Review, V31, P318, DOI 10.1145/885651.781076
  • [39] Qreshi Moinuddin K., 2012, P 2012 45 ANN INT S, P12, DOI [10.1109/MICRO.2012.30, DOI 10.1109/MICR0.2012.30]
  • [40] A Performance Analysis Framework for Identifying Potential Benefits in GPGPU Applications
    Sim, Jaewoong
    Dasgupta, Aniruddha
    Kim, Hyesoon
    Vuduc, Richard
    [J]. ACM SIGPLAN NOTICES, 2012, 47 (08) : 11 - 21