Bandwidth-Effective DRAM Cache for GPUs with Storage-Class Memory

被引:1
|
作者
Hong, Jeongmin [1 ]
Cho, Sungjun [1 ]
Park, Geonwoo [1 ]
Yang, Wonhyuk [1 ]
Gong, Young-Ho [2 ]
Kim, Gwangsun [1 ]
机构
[1] POSTECH, Dept Comp Sci & Engn, Pohang Si, South Korea
[2] Soongsil Univ, Sch Software, Seoul, South Korea
基金
新加坡国家研究基金会;
关键词
PHASE-CHANGE MEMORY; HIGH-PERFORMANCE; MAIN MEMORY; ARCHITECTURE; EFFICIENT; SYSTEM;
D O I
10.1109/HPCA57654.2024.00021
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
We propose overcoming the memory capacity limitation of GPUs with high-capacity Storage-Class Memory (SCM) and DRAM cache. By significantly increasing the memory capacity with SCM, the GPU can capture a larger fraction of the memory footprint than HBM for workloads that mandate memory oversubscription, resulting in substantial speedups. However, the DRAM cache needs to be carefully designed to address the latency and bandwidth limitations of the SCM while minimizing cost overhead and considering GPU's characteristics. Because the massive number of GPU threads can easily thrash the DRAM cache and degrade performance, we first propose an SCM-aware DRAM cache bypass policy for GPUs that considers the multidimensional characteristics of memory accesses by GPUs with SCM to bypass DRAM for data with low performance utility. In addition, to reduce DRAM cache probe traffic and increase effective DRAM BW with minimal cost overhead, we propose a Configurable Tag Cache (CTC) that repurposes part of the L2 cache to cache DRAM cacheline tags. The L2 capacity used for the CTC can be adjusted by users for adaptability. Furthermore, to minimize DRAM cache probe traffic from CTC misses, our Aggregated Metadata-In-Last-column (AMIL) DRAM cache organization co-locates all DRAM cacheline tags in a single column within a row. The AMIL also retains the full ECC protection, unlike prior DRAM cache implementation with Tag-And-Data (TAD) organization. Additionally, we propose SCM throttling to curtail power consumption and exploiting SCM's SLC/MLC modes to adapt to workload's memory footprint. While our techniques can be used for different DRAM and SCM devices, we focus on a Heterogeneous Memory Stack (HMS) organization that stacks SCM dies on top of DRAM dies for high performance. Compared to HBM, the HMS improves performance by up to 12.5x (2.9x overall) and reduces energy by up to 89.3% (48.1% overall). Compared to prior works, we reduce DRAM cache probe and SCM write traffic by 91-93% and 57-75%, respectively.
引用
收藏
页码:139 / 155
页数:17
相关论文
共 31 条
  • [21] Dynamic Adaptive Replacement Policy in Shared Last-Level Cache of DRAM/PCM Hybrid Memory for Big Data Storage
    Jia, Gangyong
    Han, Guangjie
    Jiang, Jinfang
    Liu, Li
    IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2017, 13 (04) : 1951 - 1960
  • [22] FPTree: A Hybrid SCM-DRAM Persistent and Concurrent B-Tree for Storage Class Memory
    Oukid, Ismail
    Lasperas, Johan
    Nica, Anisoara
    Willhalm, Thomas
    Lehner, Wolfgang
    SIGMOD'16: PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2016, : 371 - 386
  • [23] Workload-Based Co-Design of Non-Volatile Cache Algorithm and Storage Class Memory Specifications for Storage Class Memory/NAND Flash Hybrid SSDs
    Yamada, Tomoaki
    Matsui, Chihiro
    Takeuchi, Ken
    IEICE TRANSACTIONS ON ELECTRONICS, 2017, E100C (04): : 373 - 381
  • [24] A Resistance-Drift Compensation Scheme to Reduce MLC PCM Raw BER by Over 100x for Storage-Class Memory Applications
    Khwa, Win-San
    Chang, Meng-Fan
    Wu, Jau-Yi
    Lee, Ming-Hsiu
    Su, Tzu-Hsiang
    Yang, Keng-Hao
    Chen, Tien-Fu
    Wang, Tien-Yen
    Li, Hsiang-Pang
    BrightSky, Matthew
    Kim, SangBum
    Lung, Hsiang-Lam
    Lam, Chung
    2016 IEEE INTERNATIONAL SOLID-STATE CIRCUITS CONFERENCE (ISSCC), 2016, 59 : 134 - U179
  • [25] Homogeneous barrier modulation of TaOx/TiO2 bilayers for ultra-high endurance three-dimensional storage-class memory
    Hsu, Chung-Wei
    Wang, Yu-Fen
    Wan, Chia-Chen
    Wang, I-Ting
    Chou, Chun-Tse
    Lai, Wei-Li
    Lee, Yao-Jen
    Hou, Tuo-Hung
    NANOTECHNOLOGY, 2014, 25 (16)
  • [26] Ferroelectric Field Effect Transistors-Based Content-Addressable Storage-Class Memory: A Study on the Impact of Device Variation and High-Temperature Compatibility
    Sunil, Athira
    Rana, S. K. Masud
    Lederer, Maximilian
    Raffel, Yannick
    Mueller, Franz
    Olivo, Ricardo
    Hoffmann, Raik
    Seidel, Konrad
    Kaempfe, Thomas
    Chakrabarti, Bhaswar
    De, Sourav
    ADVANCED INTELLIGENT SYSTEMS, 2024, 6 (04)
  • [27] An adaptive L2 cache prefetching mechanism for effective exploitation of abundant memory bandwidth of 3-D IC technology
    Lim, Hong-Yeol
    Park, Gi-Ho
    IEICE ELECTRONICS EXPRESS, 2013, 10 (16):
  • [28] 3D AND: A 3D Stackable Flash Memory Architecture to Realize High-Density and Fast-Read 3D NOR Flash and Storage-Class Memory
    Lue, Hang-Ting
    Lee, Guan-Ru
    Yeh, Teng-Hao
    Hsu, Tzu-Hsuan
    Lo, Chieh Roger
    Sung, Cheng-Lin
    Chen, Wei-Chen
    Huang, Chia-Tze
    Shen, Kuan-Yuan
    Wu, Meng-Yen
    Tseng, Pishan
    Hung, Min-Feng
    Chiu, Chia-Jung
    Hsieh, Kuang-Yeu
    Wang, Keh-Chung
    Lu, Chih-Yuan
    2020 IEEE INTERNATIONAL ELECTRON DEVICES MEETING (IEDM), 2020,
  • [29] A 1.2 V 8 Gb 8-Channel 128 GB/s High-Bandwidth Memory (HBM) Stacked DRAM With Effective I/O Test Circuits
    Lee, Dong Uk
    Kim, Kyung Whan
    Kim, Kwan Weon
    Lee, Kang Seol
    Byeon, Sang Jin
    Kim, Jae Hwan
    Cho, Jin Hee
    Lee, Jaejin
    Chun, Jun Hyun
    IEEE JOURNAL OF SOLID-STATE CIRCUITS, 2015, 50 (01) : 191 - 203
  • [30] A 10Mbit, 15GBytes/sec bandwidth 1T DRAM chip with planar MOS storage capacitor in an unmodified 150nm logic process for high-density on-chip memory applications
    Somasekhar, D
    Lu, SL
    Bloechel, B
    Dermer, G
    Lai, K
    Borkar, S
    De, V
    ESSCIRC 2005: PROCEEDINGS OF THE 31ST EUROPEAN SOLID-STATE CIRCUITS CONFERENCE, 2005, : 355 - 358