SWM: A High-Performance Sparse-Winograd Matrix Multiplication CNN Accelerator

被引:22
|
作者
Wu, Di [1 ]
Fan, Xitian [2 ]
Cao, Wei [3 ]
Wang, Lingli [3 ]
机构
[1] Fudan Univ, State Key Lab Applicat Specif Integrated Circuit, Shanghai 201203, Peoples R China
[2] Fudan Univ, Sch Comp Sci, Shanghai 201203, Peoples R China
[3] Fudan Univ, Sch Microelect, Shanghai 201203, Peoples R China
基金
中国国家自然科学基金;
关键词
Convolution; Sparse matrices; Acceleration; Load modeling; Kernel; Inference algorithms; Very large scale integration; Convolutional neural network (CNN) acceleration; convolution partition; load balance; sparse; Winograd transformation; ARCHITECTURE;
D O I
10.1109/TVLSI.2021.3060041
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Many convolutional neural network (CNN) accelerators are proposed to exploit the sparsity of the networks recently to enjoy the benefits of both computation and memory reduction. However, most accelerators cannot exploit the sparsity of both activations and weights. For those works that exploit both sparsity opportunities, they cannot achieve the stable load balance through a static scheduling (SS) strategy, which is vulnerable to the sparsity distribution. In this work, a balanced compressed sparse row format and a dynamic scheduling strategy are proposed to improve the load balance. A set-associate structure is also presented to tradeoff the load balance and hardware resource overhead. We propose SWM to accelerate the CNN inference, which supports both sparse convolution and sparse fully connected (FC) layers. SWM provides Winograd adaptability for large convolution kernels and supports both 16-bit and 8-bit quantized CNNs. Due to the activation sharing, 8-bit processing can achieve theoretically twice the performance of the 16-bit processing with the same sparsity. The architecture is evaluated with VGG16 and ResNet50, which achieves: at most 7.6 TOP/s for sparse-Winograd convolution and three TOP/s for sparse matrix multiplication with 16-bit quantization on Xilinx VCU1525 platform. SWM can process 310/725 images per second for VGG16/ResNet50 with 16-bit quantization. Compared with the state-of-the-art works, our design can achieve at least 1.53x speedup and 1.8x energy efficiency improvement.
引用
收藏
页码:936 / 949
页数:14
相关论文
共 40 条
  • [31] S2 REDUCER: HIGH-PERFORMANCE SPARSE COMMUNICATION TO ACCELERATE DISTRIBUTED DEEP LEARNING
    Ge, Keshi
    Fu, Yongquan
    Zhang, Yiming
    Lai, Zhiquan
    Deng, Xiaoge
    Li, Dongsheng
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 5233 - 5237
  • [32] COPMA: Compact and Optimized Polynomial Multiplier Accelerator for High-Performance Implementation of LWR-Based PQC
    He, Pengzhou
    Tu, Yazheng
    Bao, Tianyou
    Sousa, Leonel
    Xie, Jiafeng
    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2023, 31 (04) : 596 - 600
  • [33] RSD-based high-performance radix-4 Montgomery Modular Multiplication for Elliptic Curve Cryptography
    Zhao, Shilei
    Zheng, Jiwen
    Shao, Yutong
    Huang, Hai
    Liu, Zhiwei
    Yu, Bin
    Zhang, Ziyue
    MICROELECTRONICS JOURNAL, 2024, 153
  • [34] An Efficient CNN Accelerator Achieving High PE Utilization Using a Dense-/Sparse-Aware Redundancy Reduction Method and Data-Index Decoupling Workflow
    Meng, Yishuo
    Yang, Chen
    Xiang, Siwei
    Wang, Jianfei
    Mei, Kuizhi
    Geng, Li
    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2023, 31 (10) : 1537 - 1550
  • [35] Hardware Accelerator Integration Tradeoffs for High-Performance Computing: A Case Study of GEMM Acceleration in N-Body Methods
    Asri, Mochamad
    Malhotra, Dhairya
    Wang, Jiajun
    Biros, George
    John, Lizy K.
    Gerstlauer, Andreas
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (08) : 2035 - 2048
  • [36] FireFly v2: Advancing Hardware Support for High-Performance Spiking Neural Network With a Spatiotemporal FPGA Accelerator
    Li, Jindong
    Shen, Guobin
    Zhao, Dongcheng
    Zhang, Qian
    Zeng, Yi
    IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2024, 43 (09) : 2647 - 2660
  • [37] Sulfur cathode based on layered carbon matrix for high-performance Li-S batteries
    Wu, Feng
    Qian, Ji
    Chen, Renjie
    Zhao, Teng
    Xu, Rui
    Ye, Yusheng
    Li, Wenhui
    Li, Li
    Lu, Jun
    Amine, Khalil
    NANO ENERGY, 2015, 12 : 742 - 749
  • [38] OPT-GCN: A Unified and Scalable Chiplet-Based Accelerator for High-Performance and Energy-Efficient GCN Computation
    Zhao, Yingnan
    Wang, Ke
    Louri, Ahmed
    IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2024, 43 (12) : 4827 - 4840
  • [39] Hybrid parallelisation scheme for the application of distributed near-field sparse approximate inverse preconditioners on high-performance computing clusters
    Delgado, Carlos
    Garcia, Eliseo
    Somolinos, Alvaro
    Catedra, Manuel Felipe
    IET MICROWAVES ANTENNAS & PROPAGATION, 2020, 14 (04) : 320 - 328
  • [40] Supercritical CO2 mediated incorporation of sulfur into carbon matrix as cathode materials towards high-performance lithium-sulfur batteries
    Fang, Ruyi
    Liang, Chu
    Xia, Yang
    Xiao, Zhen
    Huang, Hui
    Gan, Yongping
    Zhang, Jun
    Tao, Xinyong
    Zhang, Wenkui
    JOURNAL OF MATERIALS CHEMISTRY A, 2018, 6 (01) : 212 - 222