Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology

被引:133
作者
Lee, Sukhan [1 ]
Kang, Shin-haeng [1 ]
Lee, Jaehoon [1 ]
Kim, Hyeonsu [2 ]
Lee, Eojin [1 ]
Seo, Seungwoo [2 ]
Yoon, Hosang [2 ]
Lee, Seungwon [2 ]
Lim, Kyounghwan [1 ]
Shin, Hyunsung [1 ]
Kim, Jinhyun [1 ]
Seongil, O. [1 ]
Iyer, Anand [3 ]
Wang, David [3 ]
Sohn, Kyomin [1 ]
Kim, Nam Sung [1 ]
机构
[1] Samsung Elect, Memory Business Div, Suwon, South Korea
[2] Samsung Elect, Samsung Adv Inst Technol, Suwon, South Korea
[3] Samsung Elect, Device Solut Amer, Suwon, South Korea
来源
2021 ACM/IEEE 48TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2021) | 2021年
关键词
processing in memory; neural network; accelerator; DRAM;
D O I
10.1109/ISCA52012.2021.00013
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Emerging applications such as deep neural network demand high off-chip memory bandwidth. However, under stringent physical constraints of chip packages and system boards, it becomes very expensive to further increase the bandwidth of off-chip memory. Besides, transferring data across the memory hierarchy constitutes a large fraction of total energy consumption of systems, and the fraction has steadily increased with the stagnant technology scaling and poor data reuse characteristics of such emerging applications. To cost-effectively increase the bandwidth and energy efficiency, researchers began to reconsider the past processing-in-memory (PIM) architectures and advance them further, especially exploiting recent integration technologies such as 2.5D/3D stacking. Albeit the recent advances, no major memory manufacturer has developed even a proof-of-concept silicon yet, not to mention a product. This is because the past PIM architectures often require changes in host processors and/or application code which memory manufacturers cannot easily govern. In this paper, elegantly tackling the aforementioned challenges, we propose an innovative yet practical PIM architecture. To demonstrate its practicality and effectiveness at the system level, we implement it with a 20nm DRAM technology, integrate it with an unmodified commercial processor, develop the necessary software stack, and run existing applications without changing their source code. Our evaluation at the system level shows that our PIM improves the performance of memory-bound neural network kernels and applications by 11.2x and 3.5 x , respectively. Atop the performance improvement, PIM also reduces the energy per bit transfer by 3.5x, and the overall energy efficiency of the system running the applications by 3.2 x .
引用
收藏
页码:43 / 56
页数:14
相关论文
共 51 条
[1]  
Alian M., 2018, INT S MICR MICRO
[2]  
Amodei D, 2016, PR MACH LEARN RES, V48
[3]  
[Anonymous], 2015, INT S COMP ARCH ISCA
[4]  
[Anonymous], 2018, Standard JESD250C
[5]  
Asghari-Moghaddam H, 2016, INT SYMP MICROARCH
[6]  
Chetlur Sharan, 2014, ARXIV14100759
[7]  
Cho K, 2016, 2016 PAN PACIFIC MICROELECTRONICS SYMPOSIUM (PAN PACIFIC)
[8]   Serving DNNs in Real Time at Datacenter Scale with Project Brainwave [J].
Chung, Eric ;
Fowers, Jeremy ;
Ovtcharov, Kalin ;
Papamichael, Michael ;
Caulfield, Adrian ;
Massengill, Todd ;
Liu, Ming ;
Lo, Daniel ;
Alkalay, Shlomi ;
Haselman, Michael ;
Abeydeera, Maleen ;
Adams, Logan ;
Angepat, Hari ;
Boehn, Christian ;
Chiou, Derek ;
Firestein, Oren ;
Forin, Alessandro ;
Gatlin, Kang Su ;
Ghandi, Mahdi ;
Heil, Stephen ;
Holohan, Kyle ;
El Husseini, Ahmad ;
Juhasz, Tamas ;
Kagi, Kara ;
Kovvuri, Ratna K. ;
Lanka, Sitaram ;
van Megen, Friedel ;
Mukhortov, Dima ;
Patel, Prerak ;
Perez, Brandon ;
Rapsang, Amanda Grace ;
Reinhardt, Steven K. ;
Rouhani, Bita Darvish ;
Sapek, Adam ;
Seera, Raja ;
Shekar, Sangeetha ;
Sridharan, Balaji ;
Weisz, Gabriel ;
Woods, Lisa ;
Xiao, Phillip Yi ;
Zhang, Dan ;
Zhao, Ritchie ;
Burger, Doug .
IEEE MICRO, 2018, 38 (02) :8-20
[9]  
Dakkak A., 2019, INT C SUP ICS
[10]  
Elliott D.G., 1999, IEEE DES TEST COMPUT, V16