ADC-PIM: Accelerating Convolution on the GPU via In-Memory Approximate Data Comparison

被引:3
作者
Choi, Jungwoo [1 ]
Lee, Hyuk-Jae [1 ]
Rhee, Chae Eun [2 ]
机构
[1] Seoul Natl Univ, Interuniv Semicond Res Ctr ISRC, Dept Elect Engn & Comp Sci, Seoul 08826, South Korea
[2] Inha Univ, Dept Informat & Commun Engn, Incheon 22212, South Korea
关键词
Convolution; Graphics processing units; Approximate computing; Bandwidth; Through-silicon vias; Table lookup; Random access memory; Processing-in-memory; approximate computing; convolutional neural networks; GPU; NEURAL-NETWORK; DRAM;
D O I
10.1109/JETCAS.2022.3167391
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Recently, convolutional neural networks (CNN) have been widely used in image processing and computer vision. GPUs are often used to accelerate the CNN, but performance is limited by high computational costs and memory usage of the convolution. Prior studies exploited approximate computing to reduce the computational costs. However, they only reduced the amount of the computation, thereby its performance is bottlenecked by the memory bandwidth due to an increased memory intensity. In addition, load imbalance between warps caused by approximation also inhibits the performance improvement. In this paper, we propose a processing-in-memory (PIM) solution that reduces the amount of data movement and computation through the Approximate Data Comparison (ADC-PIM). Instead of determining the value similarity after loading the data to the GPU, the ADC-PIM unit located on 3D-stacked memory compares the similarity and transfers only the selected representative data to the GPU. The GPU performs convolution on the representative data transferred from the ADC-PIM, and reuses the calculated results based on the similarity information. To reduce the increase in memory latency caused by the in-memory data comparison, we propose a two-level PIM architecture that exploits both the DRAM bank and TSV stage. By dividing the comparisons into multiple banks and then merging the results on the TSV stage, the ADC-PIM effectively hides the delay caused by the comparisons. To ease the load balancing on the GPU, the ADC-PIM performs data reorganization by assigning the representative data to addresses that are computed based on the comparison result. Experimental results show that the proposed ADC-PIM provides a 43% speedup and 32% energy saving with less than a 1% accuracy drop.
引用
收藏
页码:458 / 471
页数:14
相关论文
共 40 条
[1]   A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing [J].
Ahn, Junwhan ;
Hong, Sungpack ;
Yoo, Sungjoo ;
Mutlu, Onur ;
Choi, Kiyoung .
2015 ACM/IEEE 42ND ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2015, :105-117
[2]   Data Reorganization in Memory Using 3D-stacked DRAM [J].
Akin, Berkin ;
Franchetti, Franz ;
Hoe, James C. .
2015 ACM/IEEE 42ND ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2015, :131-143
[3]  
[Anonymous], 2014, HPDC
[4]  
[Anonymous], 2002, PROC ACM S THEORY CO
[5]   Low-Power Approximate Multipliers Using Encoded Partial Products and Approximate Compressors [J].
Ansari, Mohammad Saeed ;
Jiang, Honglan ;
Cockburn, Bruce F. ;
Han, Jie .
IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, 2018, 8 (03) :404-416
[6]  
Bakhoda A, 2009, INT SYM PERFORM ANAL, P163, DOI 10.1109/ISPASS.2009.4919648
[7]   CACTI 7: New Tools for Interconnect Exploration in Innovative Off-Chip Memories [J].
Balasubramonian, Rajeev ;
Kahng, Andrew B. ;
Muralimanohar, Naveen ;
Shafiee, Ali ;
Srinivas, Vaishnav .
ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2017, 14 (02)
[8]   Scalable Effort Hardware Design [J].
Chippa, Vinay Kumar ;
Mohapatra, Debabrata ;
Roy, Kaushik ;
Chakradhar, Srimat T. ;
Raghunathan, Anand .
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2014, 22 (09) :2004-2016
[9]   A Lightweight and Efficient GPU for NDP Utilizing Data Access Pattern of Image Processing [J].
Choi, Jungwoo ;
Kim, Boyeal ;
Jeon, Ji-Ye ;
Lee, Hyuk-Jae ;
Lim, Euicheol ;
Rhee, Chae Eun .
IEEE TRANSACTIONS ON COMPUTERS, 2022, 71 (01) :13-26
[10]   Multiplication by Rational Constants [J].
de Dinechin, Florent .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2012, 59 (02) :98-102