RecPIM: Efficient In-Memory Processing for Personalized Recommendation Inference Using Near-Bank Architecture

被引:0
作者
Yang, Weidong [1 ]
Yang, Yuqing [1 ]
Ji, Shuya [1 ]
Jiang, Jianfei [1 ]
Jing, Naifeng [1 ]
Wang, Qin [1 ]
Mao, Zhigang [1 ]
Sheng, Weiguang [1 ]
机构
[1] Shanghai Jiao Tong Univ, Dept Micro Nano Elect, Shanghai 200240, Peoples R China
关键词
Bandwidth; Through-silicon vias; Vectors; Computational modeling; Indexes; Random access memory; Programming; 3D-stacked memory; data reuse; mapping scheme; personalized recommendation; processing-in-memory; ACCELERATOR;
D O I
10.1109/TCAD.2024.3386117
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Deep learning (DL)-based personalized recommendation systems consume the major resources in modern AI data centers. The embedding layers with large memory capacity requirement and high bandwidth demand have been identified as the bottleneck of personalized recommendation inference. To mitigate the memory bandwidth bottleneck, near-memory processing (NMP) would be an effective solution which utilizes the through-silicon via (TSV) bandwidth within 3D-stacked DRAMs. However, existing NMP architectures suffer from the limited memory bandwidth caused by hard-to-scale TSVs. To overcome this obstacle, integrating the compute-logic near memory banks becomes a promising but challenging solution, since large memory capacity requirement limits the use of 3D-stacked DRAMs and irregular memory accesses lead to poor data locality, heavy TSV data traffic and low bank-level bandwidth utilization. To address this problem, we propose RecPIM, the first in-memory processing system for personalized recommendation inference using near-bank architecture based on 3D-stacked memory. From the hardware perspective, we introduce a heterogeneous memory system combined with 3D-stacked DRAM and DIMMs to accommodate large embedding tables and provide high bandwidth. By integrating processing logic units near memory banks on DRAM dies, our architecture can exploit the enormous bank-level bandwidth which is much higher than TSV bandwidth. Then, we integrate a small scratchpad memory to exploit the unique data reusability of DL-based personalized recommendation systems. Furthermore, we adopt a unidirectional data communication scheme to avoid additional cross-vault data transfer. From the software perspective, we present a customized programming model to facilitate memory management and task offloading. To reduce the data communication through TSVs and enhance the utilization of bank-level bandwidth, we develop an efficient data mapping scheme by partitioning the vector into smaller subvectors. Experimental results show that RecPIM achieves up to 2.58x speedup and 49.8% energy saving for data movement over the state-of-the-art NMP solution.
引用
收藏
页码:2854 / 2867
页数:14
相关论文
共 57 条
[1]   Co-ML: A Case for Collaborative ML Acceleration using Near-data Processing [J].
Aga, Shaizeen ;
Jayasena, Nuwan ;
Ignatowski, Mike .
MEMSYS 2019: PROCEEDINGS OF THE INTERNATIONAL SYMPOSIUM ON MEMORY SYSTEMS, 2019, :506-517
[2]   A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing [J].
Ahn, Junwhan ;
Hong, Sungpack ;
Yoo, Sungjoo ;
Mutlu, Onur ;
Choi, Kiyoung .
2015 ACM/IEEE 42ND ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2015, :105-117
[3]   PIM-Enabled Instructions: A Low-Overhead, Locality-Aware Processing-in-Memory Architecture [J].
Ahn, Junwhan ;
Yoo, Sungjoo ;
Mutlu, Onur ;
Choi, Kiyoung .
2015 ACM/IEEE 42ND ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2015, :336-348
[4]  
[Anonymous], About us
[5]  
[Anonymous], About Us
[6]  
[Anonymous], ABOUT US
[7]  
[Anonymous], About us
[8]  
Burd Thomas., 2022, 2022 IEEE International Solid- State Circuits Conference (ISSCC), V65, P1, DOI DOI 10.1109/ISSCC42614.2022.9731678
[9]  
Chen K, 2012, DES AUT TEST EUROPE, P33
[10]   Communication Lower Bound in Convolution Accelerators [J].
Chen, Xiaoming ;
Han, Yinhe ;
Wang, Yu .
2020 IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA 2020), 2020, :529-541