Research and Optimization of the Winograd-Based Convolutional Algorithm on ShenWei-26010 Many-Core Processor

被引：0

作者：

Wu Z. ^{[1
]}

Jin X. ^{[1
]}

An H. ^{[1
]}

机构：

[1] School of Computer Science and Technology, University of Science and Technology of China, Hefei

来源：

Jisuanji Yanjiu yu Fazhan/Computer Research and Development | 2024年 / 61卷 / 04期

关键词：

deep learning; high-performance computing; parallel algorithm; ShenWei processor; Winograd-based convolution;

D O I：

10.7544/issn1000-1239.202220787

中图分类号：

学科分类号：

摘要：

As a critical component, convolution is frequently applied in deep learning. The parallel algorithms of convolution have always been a popular research topic in high-performance computing. With the rapid development of Chinese homegrown ShenWei-26010 many-core processor in artificial intelligence, there is an urgent demand for high-performance convolutional algorithms on the processor. We propose an efficient convolutional design, that is, the fused Winograd-based convolutional algorithm, towarding ShenWei-26010 architectural characteristics and the computational features of Winograd-based convolution. Unlike the traditional Winograd-based convolutional algorithm that depends on the official GEMM (general matrix multiplication) library interface, the proposed algorithm owns the customized matrix multiplication implementation. The feature makes the execution process of the proposed algorithm visible, which can better adapt to common convolutions in reality. The proposed algorithm is composed of four parts: input Winograd transformation, filter Winograd transformation, core operation, and output Winograd inverse transformation. The four parts are fused together instead of executing each part separately. The core operation can gain the required transformed data in real time. Subsequently, the computational results are transformed inversely to the final output immediately. The fused execution improves the data locality of the proposed algorithm to reduce the memory access overhead significantly. Moreover, We design other optimization methods to enhance the performance of the proposed algorithm, such as merged Winograd-transformed mode, DMA (direct memory access) double buffering, the enhanced usage of on-chip storage, the elastic processing of output data tiles, and instruction reordering. The experiments show the performance of the proposed algorithm is 7.8 times that of the traditional Winograd-based convolutional algorithm on VGG network model. Moreover, we extract the common convolution from multiple typical convolutional neural networks to measure the hardware efficiency. The results show the proposed algorithm can significantly overperform the traditional Winograd-based convolutional algorithm for all the convolution cases. The best performance of the proposed algorithm is 116.21% of the theoretical peak performance of ShenWei-26010 processor, and the average one can reach 93.14%. © 2024 Science Press. All rights reserved.

引用

页码：955 / 972

页数：17

共 30 条

[1]

Khan S, Rahmani H, Shah S A A, Et al., A guide to convolutional neural networks for computer vision[J], Synthesis Lectures on Computer Vision, 8, 1, pp. 1-207, (2018)

[2]

Abdel-Hamid O, Mohamed A, Jiang Hui, Et al., Convolutional neural networks for speech recognition[J], IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22, 10, pp. 1533-1545, (2014)

[3]

Zhenchao Ouyang, Jianwei Niu, Liu Yu, Et al., Deep CNN-based real-time traffic light detector for self-driving vehicles[J], IEEE Transactions on Mobile Computing, 19, 2, (2019)

[4]

Litjens G, Kooi T, Bejnordi B E, Et al., A survey on deep learning in medical image analysis[J], Medical Image Analysis, 42, 9, (2017)

[5]

Yi Zhang, Bing Shu, Yin Yan, Et al., Efficient processing of convolutional neural networks on SW26010 [C], Proc of the 16th IFIP Int Conf on Network and Parallel Computing, pp. 316-321, (2019)

[6]

Rui Xu, Sheng Ma, Yang Guo, Performance analysis of different convolution algorithms in GPU environment [C], Proc of the 13th IEEE Int Conf on Networking, Architecture and Storage, pp. 45-54, (2018)

[7]

San Juan P, Castello A, Dolz M F, Et al., High performance and portable convolution operators for multicore processors [C], Proc of the 32nd IEEE Int Symp on Computer Architecture and High Performance Computing, pp. 91-98, (2020)

[8]

Jiarui Fang, Haohuan Fu, Zhao Wenlai, Et al., swdnn: A library for accelerating deep learning applications on Sunway Taihulight [C], Proc of the 31st IEEE Int Parallel and Distributed Processing Symp, pp. 615-624, (2017)

[9]

Nguyen-Thanh N, Le-Duc H, Ta D T, Et al., Energy efficient techniques using FFT for deep convolutional neural networks [C], Proc of the 9th of Int Conf on Advanced Technologies for Communications, pp. 231-236, (2016)

[10]

Lavin A, Gray S., Fast algorithms for convolutional neural networks [C], Proc of the 29th IEEE Conf on Computer Vision and Pattern Recognition, pp. 4013-4021, (2016)

← 1 2 3 →