SWM: A High-Performance Sparse-Winograd Matrix Multiplication CNN Accelerator

被引：22

作者：

Wu, Di ^{[1
]}

Fan, Xitian ^{[2
]}

Cao, Wei ^{[3
]}

Wang, Lingli ^{[3
]}

机构：

[1] Fudan Univ, State Key Lab Applicat Specif Integrated Circuit, Shanghai 201203, Peoples R China

[2] Fudan Univ, Sch Comp Sci, Shanghai 201203, Peoples R China

[3] Fudan Univ, Sch Microelect, Shanghai 201203, Peoples R China

来源：

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS | 2021年 / 29卷 / 05期

基金：

中国国家自然科学基金;

关键词：

Convolution; Sparse matrices; Acceleration; Load modeling; Kernel; Inference algorithms; Very large scale integration; Convolutional neural network (CNN) acceleration; convolution partition; load balance; sparse; Winograd transformation; ARCHITECTURE;

D O I：

10.1109/TVLSI.2021.3060041

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Many convolutional neural network (CNN) accelerators are proposed to exploit the sparsity of the networks recently to enjoy the benefits of both computation and memory reduction. However, most accelerators cannot exploit the sparsity of both activations and weights. For those works that exploit both sparsity opportunities, they cannot achieve the stable load balance through a static scheduling (SS) strategy, which is vulnerable to the sparsity distribution. In this work, a balanced compressed sparse row format and a dynamic scheduling strategy are proposed to improve the load balance. A set-associate structure is also presented to tradeoff the load balance and hardware resource overhead. We propose SWM to accelerate the CNN inference, which supports both sparse convolution and sparse fully connected (FC) layers. SWM provides Winograd adaptability for large convolution kernels and supports both 16-bit and 8-bit quantized CNNs. Due to the activation sharing, 8-bit processing can achieve theoretically twice the performance of the 16-bit processing with the same sparsity. The architecture is evaluated with VGG16 and ResNet50, which achieves: at most 7.6 TOP/s for sparse-Winograd convolution and three TOP/s for sparse matrix multiplication with 16-bit quantization on Xilinx VCU1525 platform. SWM can process 310/725 images per second for VGG16/ResNet50 with 16-bit quantization. Compared with the state-of-the-art works, our design can achieve at least 1.53x speedup and 1.8x energy efficiency improvement.

引用

页码：936 / 949

页数：14

共 40 条

[21] Performance portability of sparse matrix-vector multiplication implemented using OpenMP, OpenACC and SYCL
Stec, Kinga
Stpiczynski, Przemyslaw
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2025, 170
[22] Evaluation of the performance of parallel sparse-matrix multiplication and the effect of dynamic load-balancing
Nanri, Takeshi
Soga, Takeshi
Kurihara, Koji
Gu, Feng Long
Ishihata, Hiroaki
Murakami, Kazuaki
COMPUTATION IN MODERN SCIENCE AND ENGINEERING VOL 2, PTS A AND B, 2007, 2 : 106 - +
[23] High-performance pipeline architecture for packet classification accelerator in DPU
Tan, Jing
Lv, GaoFeng
Ma, Yanni
Qiao, GuanJie
2021 INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE TECHNOLOGY (ICFPT), 2021, : 286 - 289
[24] A Streaming Dataflow Engine for Sparse Matrix-Vector Multiplication Using High-Level Synthesis
Hosseinabady, Mohammad
Nunez-Yanez, Jose Luis
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2020, 39 (06) : 1272 - 1285
[25] High-performance of the Multiplication over the Quadratic Extension in Montgomery Domain for the Pairing Cryptosystems
Mrabet, Amine
Darmon, Patrice
2019 19TH INTERNATIONAL CONFERENCE ON SCIENCES AND TECHNIQUES OF AUTOMATIC CONTROL AND COMPUTER ENGINEERING (STA), 2019, : 79 - 83
[26] MPRTA: An Efficient Multilevel Parallel Mobile Accelerator for High-Performance Ray Tracing
Yan, Run
Su, Yin
Guo, Hui
Lu, Yashuai
Wang, Jin
Xiao, Nong
Shen, Li
Wang, Yongwen
Huang, Libo
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2024, 32 (02) : 396 - 400
[27] Sketching-based High-Performance Biomedical Big Data Processing Accelerator
Kulkarni, Amey
Jafari, Ali
Sagedy, Chris
Mohsenin, Tinoosh
2016 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2016, : 1138 - 1141
[28] Research on Optimal Performance of Sparse Matrix-Vector Multiplication and Convoulution Using the Probability-Process-Ram Model
Xie Z.
Tan G.
Sun N.
Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2021, 58 (03): : 445 - 457
[29] High-Performance SIFT Hardware Accelerator for Real-Time Image Feature Extraction
Huang, Feng-Cheng
Huang, Shi-Yu
Ker, Ji-Wei
Chen, Yung-Chang
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2012, 22 (03) : 340 - 351
[30] HISPOC: A High-Performance Irregular Activation Sparsity-Aware Point Cloud Network Accelerator
Zhao, Pan
Chang, Liang
Zeng, Jiahao
Wu, Licheng
Zhou, Liang
Zhou, Jun
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2024, 71 (04) : 2294 - 2298

← 1 2 3 4 →