Accelerating CNN Training With Concurrent Execution of GPU and Processing-in-Memory

被引：0

作者：

Choi, Jungwoo ^{[1
]}

Lee, Hyuk-Jae ^{[1
]}

Sohn, Kyomin ^{[2
]}

Yu, Hak-Soo ^{[2
]}

Rhee, Chae Eun ^{[3
]}

机构：

[1] Seoul Natl Univ, Interuniv Semicond Res Ctr ISRC, Dept Elect Engn & Comp Sci, Seoul 08826, South Korea

[2] Samsung Elect, Hwaseong Si 18448, South Korea

[3] Hanyang Univ, Dept Elect Engn, Seoul 04763, South Korea

来源：

IEEE ACCESS | 2024年 / 12卷

关键词：

Training; Graphics processing units; Convolutional neural networks; Pipelines; Electric breakdown; Bandwidth; Batch normalization; Switches; Scheduling algorithms; Random access memory; Processing-in-memory; convolutional neural networks; neural network training; GPU;

D O I：

10.1109/ACCESS.2024.3488004

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Training of convolutional neural networks (CNN) consumes a lot of time and resources. While most previous works have focused on accelerating the convolutional (CONV) layer, the proportion of non-convolutional (non-CONV) layers, such as batch normalization, is gradually increasing during training. Non-CONV layers have low cache reuse and arithmetic intensity, thereby performance is limited by memory bandwidth. Processing-in-memory (PIM) can utilize wide memory bandwidth, making it suitable for acceleration of non-CONV layers. Therefore, it makes sense to perform the computationally complex CONV layer on the host and handle the memory bottleneck challenges of the non-CONV layer on the PIM. Further improved performance can be expected if they run simultaneously. However, memory access conflicts between the host and PIM are the biggest factors hindering performance improvement. Prior studies proposed bank partitioning to alleviate memory conflicts, but it is not effective because CNN training involves significant data sharing between CONV and non-CONV layers. In this paper, we propose a memory scheduling and CNN training flow for the pipelined execution of CONV layers on the host and non-CONV layers on PIM. First, instead of applying bank partitioning, the host and PIM exclusively access memory for a certain period to avoid the movement of shared data between host memory and PIM memory. The conditions for switching the memory access authority between the host and PIM are set per layer, taking into account memory access characteristics and the number of queued memory requests. Second, in the training flow, CONV and non-CONV layers are pipelined in units of output feature map channels. Specifically, for the backward pass, the non-CONV tasks of the feature map gradient calculation phase and the weight gradient update phase are rearranged so that they can be easily performed within CONV layers. Experimental results show that the proposed pipelined execution achieves an average speedup of 18.1% at the network level compared to the serial operation of the host and PIM.

引用

页码：160190 / 160204

页数：15

共 55 条

[1] Co-ML: A Case for Collaborative ML Acceleration using Near-data Processing [J].

Aga, Shaizeen ;

Jayasena, Nuwan ;

Ignatowski, Mike .

MEMSYS 2019: PROCEEDINGS OF THE INTERNATIONAL SYMPOSIUM ON MEMORY SYSTEMS, 2019, :506-517

[2]

Ausavarungnirun R, 2012, CONF PROC INT SYMP C, P416, DOI 10.1109/ISCA.2012.6237036

[3]

Bergstra J, 2012, J MACH LEARN RES, V13, P281

[4]

Bishop C. M., 1995, Neural Networks for Pattern Recognition

[5] In-Place Activated BatchNorm for Memory-Optimized Training of DNNs [J].

Bulo, Samuel Rota ;

Porzi, Lorenzo ;

Kontschieder, Peter .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :5639-5647

[6]

Chen CC, 2019, Arxiv, DOI arXiv:1809.02839

[7] Near Data Acceleration with Concurrent Host Access [J].

Cho, Benjamin Y. ;

Kwon, Yongkee ;

Lym, Sangkug ;

Erez, Mattan .

2020 ACM/IEEE 47TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2020), 2020, :818-831

[8]

Cho JH, 2018, ISSCC DIG TECH PAP I, P208, DOI 10.1109/ISSCC.2018.8310257

[9] ADC-PIM: Accelerating Convolution on the GPU via In-Memory Approximate Data Comparison [J].

Choi, Jungwoo ;

Lee, Hyuk-Jae ;

Rhee, Chae Eun .

IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, 2022, 12 (02) :458-471

[10] Volta: Performance and Programmability [J].

Choquette, Jack ;

Giroux, Olivier ;

Foley, Denis .

IEEE MICRO, 2018, 38 (02) :42-52

← 1 2 3 4 5 6 →