SWM: A High-Performance Sparse-Winograd Matrix Multiplication CNN Accelerator

被引:22
|
作者
Wu, Di [1 ]
Fan, Xitian [2 ]
Cao, Wei [3 ]
Wang, Lingli [3 ]
机构
[1] Fudan Univ, State Key Lab Applicat Specif Integrated Circuit, Shanghai 201203, Peoples R China
[2] Fudan Univ, Sch Comp Sci, Shanghai 201203, Peoples R China
[3] Fudan Univ, Sch Microelect, Shanghai 201203, Peoples R China
基金
中国国家自然科学基金;
关键词
Convolution; Sparse matrices; Acceleration; Load modeling; Kernel; Inference algorithms; Very large scale integration; Convolutional neural network (CNN) acceleration; convolution partition; load balance; sparse; Winograd transformation; ARCHITECTURE;
D O I
10.1109/TVLSI.2021.3060041
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Many convolutional neural network (CNN) accelerators are proposed to exploit the sparsity of the networks recently to enjoy the benefits of both computation and memory reduction. However, most accelerators cannot exploit the sparsity of both activations and weights. For those works that exploit both sparsity opportunities, they cannot achieve the stable load balance through a static scheduling (SS) strategy, which is vulnerable to the sparsity distribution. In this work, a balanced compressed sparse row format and a dynamic scheduling strategy are proposed to improve the load balance. A set-associate structure is also presented to tradeoff the load balance and hardware resource overhead. We propose SWM to accelerate the CNN inference, which supports both sparse convolution and sparse fully connected (FC) layers. SWM provides Winograd adaptability for large convolution kernels and supports both 16-bit and 8-bit quantized CNNs. Due to the activation sharing, 8-bit processing can achieve theoretically twice the performance of the 16-bit processing with the same sparsity. The architecture is evaluated with VGG16 and ResNet50, which achieves: at most 7.6 TOP/s for sparse-Winograd convolution and three TOP/s for sparse matrix multiplication with 16-bit quantization on Xilinx VCU1525 platform. SWM can process 310/725 images per second for VGG16/ResNet50 with 16-bit quantization. Compared with the state-of-the-art works, our design can achieve at least 1.53x speedup and 1.8x energy efficiency improvement.
引用
收藏
页码:936 / 949
页数:14
相关论文
共 40 条
  • [21] Performance portability of sparse matrix-vector multiplication implemented using OpenMP, OpenACC and SYCL
    Stec, Kinga
    Stpiczynski, Przemyslaw
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2025, 170
  • [22] Evaluation of the performance of parallel sparse-matrix multiplication and the effect of dynamic load-balancing
    Nanri, Takeshi
    Soga, Takeshi
    Kurihara, Koji
    Gu, Feng Long
    Ishihata, Hiroaki
    Murakami, Kazuaki
    COMPUTATION IN MODERN SCIENCE AND ENGINEERING VOL 2, PTS A AND B, 2007, 2 : 106 - +
  • [23] High-performance pipeline architecture for packet classification accelerator in DPU
    Tan, Jing
    Lv, GaoFeng
    Ma, Yanni
    Qiao, GuanJie
    2021 INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE TECHNOLOGY (ICFPT), 2021, : 286 - 289
  • [24] A Streaming Dataflow Engine for Sparse Matrix-Vector Multiplication Using High-Level Synthesis
    Hosseinabady, Mohammad
    Nunez-Yanez, Jose Luis
    IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2020, 39 (06) : 1272 - 1285
  • [25] High-performance of the Multiplication over the Quadratic Extension in Montgomery Domain for the Pairing Cryptosystems
    Mrabet, Amine
    Darmon, Patrice
    2019 19TH INTERNATIONAL CONFERENCE ON SCIENCES AND TECHNIQUES OF AUTOMATIC CONTROL AND COMPUTER ENGINEERING (STA), 2019, : 79 - 83
  • [26] MPRTA: An Efficient Multilevel Parallel Mobile Accelerator for High-Performance Ray Tracing
    Yan, Run
    Su, Yin
    Guo, Hui
    Lu, Yashuai
    Wang, Jin
    Xiao, Nong
    Shen, Li
    Wang, Yongwen
    Huang, Libo
    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2024, 32 (02) : 396 - 400
  • [27] Sketching-based High-Performance Biomedical Big Data Processing Accelerator
    Kulkarni, Amey
    Jafari, Ali
    Sagedy, Chris
    Mohsenin, Tinoosh
    2016 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2016, : 1138 - 1141
  • [28] Research on Optimal Performance of Sparse Matrix-Vector Multiplication and Convoulution Using the Probability-Process-Ram Model
    Xie Z.
    Tan G.
    Sun N.
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2021, 58 (03): : 445 - 457
  • [29] High-Performance SIFT Hardware Accelerator for Real-Time Image Feature Extraction
    Huang, Feng-Cheng
    Huang, Shi-Yu
    Ker, Ji-Wei
    Chen, Yung-Chang
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2012, 22 (03) : 340 - 351
  • [30] HISPOC: A High-Performance Irregular Activation Sparsity-Aware Point Cloud Network Accelerator
    Zhao, Pan
    Chang, Liang
    Zeng, Jiahao
    Wu, Licheng
    Zhou, Liang
    Zhou, Jun
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2024, 71 (04) : 2294 - 2298