BATMAN: Techniques for Maximizing System Bandwidth of Memory Systems with Stacked-DRAM

被引:25
作者
Chou, Chiachen [1 ]
Jaleel, Aamer [2 ]
Qureshi, Moinuddin [1 ]
机构
[1] Georgia Inst Technol, Sch ECE, Atlanta, GA 30332 USA
[2] NVIDIA, NVIDIA Res, Santa Clara, CA USA
来源
MEMSYS 2017: PROCEEDINGS OF THE INTERNATIONAL SYMPOSIUM ON MEMORY SYSTEMS | 2017年
基金
美国国家科学基金会;
关键词
POLICIES;
D O I
10.1145/3132402.3132404
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Tiered-memory systems consist of high-bandwidth 3D-DRAM and high-capacity commodity-DRAM. Conventional designs attempt to improve system performance by maximizing the number of memory accesses serviced by 3D-DRAM. However, when the commodity-DRAM bandwidth is a significant fraction of overall system bandwidth, the techniques ineficiently utilize the total bandwidth offered by the tiered-memory system and yields suboptimal performance. In such situations, the performance can be improved by distributing memory accesses that are proportional to the bandwidth of each memory. Ideally, we want a simple and effective runtime mechanism that achieves the desired access distribution without requiring significant hardware or software support. This paper proposes Bandwidth-Aware Tiered-Memory Management (BATMAN), a runtime mechanism that manages the distribution of memory accesses in a tiered-memory system by explicitly controlling data movement. BATMAN monitors the number of accesses to both memories, and when the number of 3D-DRAM accesses exceeds the desired threshold, BATMAN disallows data movement from the commodity-DRAM to 3D-DRAM and proactively moves data from 3D-DRAM to commodity-DRAM. We demonstrate BATMAN on systems that architect the 3D-DRAM as either a hardware-managed cache (cache mode) or a part of the OS-visible memory space (flat mode). Our evaluations on a system with 4GB 3D-DRAM and 32GB commodity-DRAM show that BATMAN improves performance by an average of 11% and 10% and energy-delay product by 13% and 11% for systems in the cache and flat modes, respectively. BATMAN incurs only an eight-byte hardware overhead and requires negligible software modification.
引用
收藏
页码:268 / 280
页数:13
相关论文
共 47 条
  • [1] Agarwal N, 2015, ACM SIGPLAN NOTICES, V50, P607, DOI [10.1145/2775054.2694381, 10.1145/2694344.2694381]
  • [2] [Anonymous], P LIN S
  • [3] [Anonymous], 2015, 2015 USENIX ANN TECH
  • [4] [Anonymous], P ANN C USENIX ANN T
  • [5] [Anonymous], 2007, COMPUTER ARCHITECTUR
  • [6] Bellosa Frank, 2004, P ACM SIGOPS EUR WOR
  • [7] Bolaria Jag, 2011, MICROPROCESSOR REPOR
  • [8] BOLOSKY WJ, 1991, SIGPLAN NOTICES, V26, P212, DOI 10.1145/106973.106994
  • [9] CHANDRA R, 1994, SIGPLAN NOTICES, V29, P12, DOI 10.1145/195470.195485
  • [10] Chang DW, 2013, ASIA S PACIF DES AUT, P657, DOI 10.1109/ASPDAC.2013.6509675