SWM: A High-Performance Sparse-Winograd Matrix Multiplication CNN Accelerator

被引:23
|
作者
Wu, Di [1 ]
Fan, Xitian [2 ]
Cao, Wei [3 ]
Wang, Lingli [3 ]
机构
[1] Fudan Univ, State Key Lab Applicat Specif Integrated Circuit, Shanghai 201203, Peoples R China
[2] Fudan Univ, Sch Comp Sci, Shanghai 201203, Peoples R China
[3] Fudan Univ, Sch Microelect, Shanghai 201203, Peoples R China
基金
中国国家自然科学基金;
关键词
Convolution; Sparse matrices; Acceleration; Load modeling; Kernel; Inference algorithms; Very large scale integration; Convolutional neural network (CNN) acceleration; convolution partition; load balance; sparse; Winograd transformation; ARCHITECTURE;
D O I
10.1109/TVLSI.2021.3060041
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Many convolutional neural network (CNN) accelerators are proposed to exploit the sparsity of the networks recently to enjoy the benefits of both computation and memory reduction. However, most accelerators cannot exploit the sparsity of both activations and weights. For those works that exploit both sparsity opportunities, they cannot achieve the stable load balance through a static scheduling (SS) strategy, which is vulnerable to the sparsity distribution. In this work, a balanced compressed sparse row format and a dynamic scheduling strategy are proposed to improve the load balance. A set-associate structure is also presented to tradeoff the load balance and hardware resource overhead. We propose SWM to accelerate the CNN inference, which supports both sparse convolution and sparse fully connected (FC) layers. SWM provides Winograd adaptability for large convolution kernels and supports both 16-bit and 8-bit quantized CNNs. Due to the activation sharing, 8-bit processing can achieve theoretically twice the performance of the 16-bit processing with the same sparsity. The architecture is evaluated with VGG16 and ResNet50, which achieves: at most 7.6 TOP/s for sparse-Winograd convolution and three TOP/s for sparse matrix multiplication with 16-bit quantization on Xilinx VCU1525 platform. SWM can process 310/725 images per second for VGG16/ResNet50 with 16-bit quantization. Compared with the state-of-the-art works, our design can achieve at least 1.53x speedup and 1.8x energy efficiency improvement.
引用
收藏
页码:936 / 949
页数:14
相关论文
共 50 条
  • [41] An Efficient Gustavson-Based Sparse Matrix-Matrix Multiplication Accelerator on Embedded FPGAs
    Li, Shiqing
    Huai, Shuo
    Liu, Weichen
    IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2023, 42 (12) : 4671 - 4680
  • [42] An Accelerator for Sparse Convolutional Neural Networks Leveraging Systolic General Matrix-matrix Multiplication
    Soltaniyeh, Mohammadreza
    Martin, Richard P.
    Nagarakatte, Santosh
    ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2022, 19 (03)
  • [43] Understanding the performance of sparse matrix-vector multiplication
    Goumas, Georgios
    Kourtis, Kornilios
    Anastopoulos, Nikos
    Karakasis, Vasileios
    Koziris, Nectarios
    PROCEEDINGS OF THE 16TH EUROMICRO CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING, 2008, : 283 - +
  • [44] Unleashing the performance of bmSparse for the sparse matrix multiplication in GPUs
    Berger, Gonzalo
    Freire, Manuel
    Marini, Renzo
    Dufrechou, Ernesto
    Ezzatti, Pablo
    PROCEEDINGS OF SCALA 2021: 12TH WORKSHOP ON LATEST ADVANCES IN SCALABLE ALGORITHMS FOR LARGE- SCALE SYSTEMS, 2021, : 19 - 26
  • [45] Performance Aspects of Sparse Matrix-Vector Multiplication
    Simecek, I.
    ACTA POLYTECHNICA, 2006, 46 (03) : 3 - 8
  • [46] On improving the performance of sparse matrix-vector multiplication
    White, JB
    Sadayappan, P
    FOURTH INTERNATIONAL CONFERENCE ON HIGH-PERFORMANCE COMPUTING, PROCEEDINGS, 1997, : 66 - 71
  • [47] Fault-tolerant high-performance matrix multiplication:: Theory and practice
    Gunnels, JA
    Katz, DS
    Quintana-Ortí, ES
    van de Geijn, RA
    INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS, PROCEEDINGS, 2001, : 47 - 56
  • [48] Work-in-Progress: A High-performance FPGA Accelerator for Sparse Neural Networks
    Lu, Yuntao
    Gong, Lei
    Xu, Chongchong
    Sun, Fan
    Zhang, Yiwei
    Wang, Chao
    Zhou, Xuehai
    2017 INTERNATIONAL CONFERENCE ON COMPILERS, ARCHITECTURES AND SYNTHESIS FOR EMBEDDED SYSTEMS (CASES), 2017,
  • [49] Sparse-Sparse Matrix Multiplication Accelerator on FPGA featuring Distribute-Merge Product Dataflow
    Nagahara, Yuta
    Yan, Jiale
    Kawamura, Kazushi
    Motomura, Masato
    Chu, Thiem Van
    29TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE, ASP-DAC 2024, 2024, : 785 - 791
  • [50] Serpens: A High Bandwidth Memory Based Accelerator for General-Purpose Sparse Matrix-Vector Multiplication
    Song, Linghao
    Chi, Yuze
    Guo, Licheng
    Cong, Jason
    PROCEEDINGS OF THE 59TH ACM/IEEE DESIGN AUTOMATION CONFERENCE, DAC 2022, 2022, : 211 - 216