BATMAN: Techniques for Maximizing System Bandwidth of Memory Systems with Stacked-DRAM

被引:28
作者
Chou, Chiachen [1 ]
Jaleel, Aamer [2 ]
Qureshi, Moinuddin [1 ]
机构
[1] Georgia Inst Technol, Sch ECE, Atlanta, GA 30332 USA
[2] NVIDIA, NVIDIA Res, Santa Clara, CA USA
来源
MEMSYS 2017: PROCEEDINGS OF THE INTERNATIONAL SYMPOSIUM ON MEMORY SYSTEMS | 2017年
基金
美国国家科学基金会;
关键词
POLICIES;
D O I
10.1145/3132402.3132404
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Tiered-memory systems consist of high-bandwidth 3D-DRAM and high-capacity commodity-DRAM. Conventional designs attempt to improve system performance by maximizing the number of memory accesses serviced by 3D-DRAM. However, when the commodity-DRAM bandwidth is a significant fraction of overall system bandwidth, the techniques ineficiently utilize the total bandwidth offered by the tiered-memory system and yields suboptimal performance. In such situations, the performance can be improved by distributing memory accesses that are proportional to the bandwidth of each memory. Ideally, we want a simple and effective runtime mechanism that achieves the desired access distribution without requiring significant hardware or software support. This paper proposes Bandwidth-Aware Tiered-Memory Management (BATMAN), a runtime mechanism that manages the distribution of memory accesses in a tiered-memory system by explicitly controlling data movement. BATMAN monitors the number of accesses to both memories, and when the number of 3D-DRAM accesses exceeds the desired threshold, BATMAN disallows data movement from the commodity-DRAM to 3D-DRAM and proactively moves data from 3D-DRAM to commodity-DRAM. We demonstrate BATMAN on systems that architect the 3D-DRAM as either a hardware-managed cache (cache mode) or a part of the OS-visible memory space (flat mode). Our evaluations on a system with 4GB 3D-DRAM and 32GB commodity-DRAM show that BATMAN improves performance by an average of 11% and 10% and energy-delay product by 13% and 11% for systems in the cache and flat modes, respectively. BATMAN incurs only an eight-byte hardware overhead and requires negligible software modification.
引用
收藏
页码:268 / 280
页数:13
相关论文
共 47 条
[31]   Pin: Building customized program analysis tools with dynamic instrumentation [J].
Luk, CK ;
Cohn, R ;
Muth, R ;
Patil, H ;
Klauser, A ;
Lowney, G ;
Wallace, S ;
Reddi, VJ ;
Hazelwood, K .
ACM SIGPLAN NOTICES, 2005, 40 (06) :190-200
[32]  
McCalpin J. D., 1991, STREAM SUSTAINABLE M
[33]  
Meswani M.R., 2015, HIGH PERF COMP ARCH
[34]  
Micron, 2014, HMC GEN2
[35]  
Micron, 2012, CALC DDR MEM SYST PO
[36]  
Micron, 2010, 1GB DDR3 SDRAM
[37]  
NVIDIA, 2014, NVIDIA PASC
[38]  
Perelman E., 2003, Performance Evaluation Review, V31, P318, DOI 10.1145/885651.781076
[39]  
Qreshi Moinuddin K., 2012, P 2012 45 ANN INT S, P12, DOI [10.1109/MICRO.2012.30, DOI 10.1109/MICR0.2012.30]
[40]   A Performance Analysis Framework for Identifying Potential Benefits in GPGPU Applications [J].
Sim, Jaewoong ;
Dasgupta, Aniruddha ;
Kim, Hyesoon ;
Vuduc, Richard .
ACM SIGPLAN NOTICES, 2012, 47 (08) :11-21