SWM: A High-Performance Sparse-Winograd Matrix Multiplication CNN Accelerator

被引:23
|
作者
Wu, Di [1 ]
Fan, Xitian [2 ]
Cao, Wei [3 ]
Wang, Lingli [3 ]
机构
[1] Fudan Univ, State Key Lab Applicat Specif Integrated Circuit, Shanghai 201203, Peoples R China
[2] Fudan Univ, Sch Comp Sci, Shanghai 201203, Peoples R China
[3] Fudan Univ, Sch Microelect, Shanghai 201203, Peoples R China
基金
中国国家自然科学基金;
关键词
Convolution; Sparse matrices; Acceleration; Load modeling; Kernel; Inference algorithms; Very large scale integration; Convolutional neural network (CNN) acceleration; convolution partition; load balance; sparse; Winograd transformation; ARCHITECTURE;
D O I
10.1109/TVLSI.2021.3060041
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Many convolutional neural network (CNN) accelerators are proposed to exploit the sparsity of the networks recently to enjoy the benefits of both computation and memory reduction. However, most accelerators cannot exploit the sparsity of both activations and weights. For those works that exploit both sparsity opportunities, they cannot achieve the stable load balance through a static scheduling (SS) strategy, which is vulnerable to the sparsity distribution. In this work, a balanced compressed sparse row format and a dynamic scheduling strategy are proposed to improve the load balance. A set-associate structure is also presented to tradeoff the load balance and hardware resource overhead. We propose SWM to accelerate the CNN inference, which supports both sparse convolution and sparse fully connected (FC) layers. SWM provides Winograd adaptability for large convolution kernels and supports both 16-bit and 8-bit quantized CNNs. Due to the activation sharing, 8-bit processing can achieve theoretically twice the performance of the 16-bit processing with the same sparsity. The architecture is evaluated with VGG16 and ResNet50, which achieves: at most 7.6 TOP/s for sparse-Winograd convolution and three TOP/s for sparse matrix multiplication with 16-bit quantization on Xilinx VCU1525 platform. SWM can process 310/725 images per second for VGG16/ResNet50 with 16-bit quantization. Compared with the state-of-the-art works, our design can achieve at least 1.53x speedup and 1.8x energy efficiency improvement.
引用
收藏
页码:936 / 949
页数:14
相关论文
共 50 条
  • [31] Anatomy of High-Performance Many-Threaded Matrix Multiplication
    Smith, Tyler M.
    van de Geijn, Robert
    Smelyanskiy, Mikhail
    Hammond, Jeff R.
    Van Zee, Field G.
    2014 IEEE 28TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM, 2014,
  • [32] Design of a high-performance tensor-matrix multiplication with BLAS
    Bassoy, Cem Savas
    JOURNAL OF COMPUTATIONAL SCIENCE, 2025, 87
  • [33] HPMA-Saber: High-Performance Polynomial Multiplication Accelerator for KEM Saber
    He, Pengzhou
    Bao, Tianyou
    Tu, Yazheng
    Xie, Jiafeng
    2022 IEEE 40TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD 2022), 2022, : 525 - 528
  • [34] WRA-SS: A High-Performance Accelerator Integrating Winograd With Structured Sparsity for Convolutional Neural Networks
    Yang, Chen
    Meng, Yishuo
    Xi, Jiawei
    Xiang, Siwei
    Wang, Jianfei
    Mei, Kuizhi
    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2024, 32 (01) : 164 - 177
  • [35] A high-performance matrix-matrix multiplication methodology for CPU and GPU architectures
    Kelefouras, Vasilios
    Kritikakou, A.
    Mporas, Iosif
    Kolonias, Vasilios
    JOURNAL OF SUPERCOMPUTING, 2016, 72 (03): : 804 - 844
  • [36] High Performance Accelerator for CNN Applications
    Kyriakos, Angelos
    Kitsakis, Vasileios
    Louropoulos, Alexandros
    Papatheofanous, Elissaios-Alexios
    Patronas, Ioannis
    Reisis, Dionysios
    2019 IEEE 29TH INTERNATIONAL SYMPOSIUM ON POWER AND TIMING MODELING, OPTIMIZATION AND SIMULATION (PATMOS 2019), 2019, : 135 - 140
  • [37] MatRaptor: A Sparse-Sparse Matrix Multiplication Accelerator Based on Row-Wise Product
    Srivastava, Nitish
    Jin, Hanchen
    Liu, Jie
    Albonesi, David
    Zhang, Zhiru
    2020 53RD ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO 2020), 2020, : 766 - 780
  • [38] High-Performance Mixed-Low-Precision CNN Inference Accelerator on FPGA
    Wang, Junbin
    Fang, Shaoxia
    Wang, Xi
    Ma, Jiangsha
    Wang, Taobo
    Shan, Yi
    IEEE MICRO, 2021, 41 (04) : 31 - 38
  • [39] XVDPU: A High-Performance CNN Accelerator on the Versal Platform Powered by the AI Engine
    Jia, Xijie
    Zhang, Yu
    Liu, Guangdong
    Yang, Xinlin
    Zhang, Tianyu
    Zheng, Jia
    Xu, Dongdong
    Liu, Zhuohuan
    Liu, Mengke
    Yan, Xiaoyang
    Wang, Hong
    Zheng, Rongzhang
    Wang, Li
    Li, Dong
    Pareek, Satyaprakash
    Weng, Jian
    Tian, Lu
    Xie, Dongliang
    Luo, Hong
    Shan, Yi
    ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS, 2024, 17 (02)
  • [40] Spectral-Blaze: A High-Performance FFT-Based CNN Accelerator
    Sunny, Shine Parekkadan
    Das, Satyajit
    APPLIED RECONFIGURABLE COMPUTING. ARCHITECTURES, TOOLS, AND APPLICATIONS, ARC 2024, 2024, 14553 : 222 - 238