SWM: A High-Performance Sparse-Winograd Matrix Multiplication CNN Accelerator

被引:22
|
作者
Wu, Di [1 ]
Fan, Xitian [2 ]
Cao, Wei [3 ]
Wang, Lingli [3 ]
机构
[1] Fudan Univ, State Key Lab Applicat Specif Integrated Circuit, Shanghai 201203, Peoples R China
[2] Fudan Univ, Sch Comp Sci, Shanghai 201203, Peoples R China
[3] Fudan Univ, Sch Microelect, Shanghai 201203, Peoples R China
基金
中国国家自然科学基金;
关键词
Convolution; Sparse matrices; Acceleration; Load modeling; Kernel; Inference algorithms; Very large scale integration; Convolutional neural network (CNN) acceleration; convolution partition; load balance; sparse; Winograd transformation; ARCHITECTURE;
D O I
10.1109/TVLSI.2021.3060041
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Many convolutional neural network (CNN) accelerators are proposed to exploit the sparsity of the networks recently to enjoy the benefits of both computation and memory reduction. However, most accelerators cannot exploit the sparsity of both activations and weights. For those works that exploit both sparsity opportunities, they cannot achieve the stable load balance through a static scheduling (SS) strategy, which is vulnerable to the sparsity distribution. In this work, a balanced compressed sparse row format and a dynamic scheduling strategy are proposed to improve the load balance. A set-associate structure is also presented to tradeoff the load balance and hardware resource overhead. We propose SWM to accelerate the CNN inference, which supports both sparse convolution and sparse fully connected (FC) layers. SWM provides Winograd adaptability for large convolution kernels and supports both 16-bit and 8-bit quantized CNNs. Due to the activation sharing, 8-bit processing can achieve theoretically twice the performance of the 16-bit processing with the same sparsity. The architecture is evaluated with VGG16 and ResNet50, which achieves: at most 7.6 TOP/s for sparse-Winograd convolution and three TOP/s for sparse matrix multiplication with 16-bit quantization on Xilinx VCU1525 platform. SWM can process 310/725 images per second for VGG16/ResNet50 with 16-bit quantization. Compared with the state-of-the-art works, our design can achieve at least 1.53x speedup and 1.8x energy efficiency improvement.
引用
收藏
页码:936 / 949
页数:14
相关论文
共 40 条
  • [1] A LOW-LATENCY SPARSE-WINOGRAD ACCELERATOR FOR CONVOLUTIONAL NEURAL NETWORKS
    Wang, Haonan
    Liu, Wenjian
    Xu, Tianyi
    Lin, Jun
    Wang, Zhongfeng
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 1448 - 1452
  • [2] A Dynamically Reconfigurable Accelerator Design Using a Sparse-Winograd Decomposition Algorithm for CNNs
    Zhao, Yunping
    Lu, Jianzhuang
    Chen, Xiaowen
    CMC-COMPUTERS MATERIALS & CONTINUA, 2021, 66 (01): : 517 - 535
  • [3] A High-Performance Accelerator for Floating-Point Matrix Multiplication
    Jia, Xun
    Wu, Gunning
    Xie, Xianghui
    2017 15TH IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS AND 2017 16TH IEEE INTERNATIONAL CONFERENCE ON UBIQUITOUS COMPUTING AND COMMUNICATIONS (ISPA/IUCC 2017), 2017, : 396 - 402
  • [4] A Context-Awareness and Hardware-Friendly Sparse Matrix Multiplication Kernel for CNN Inference Acceleration
    Wang, Haotian
    Ding, Yan
    Liu, Yumeng
    Liu, Weichen
    Liu, Chubo
    Yang, Wangdong
    Li, Kenli
    IEEE TRANSACTIONS ON COMPUTERS, 2025, 74 (04) : 1182 - 1195
  • [5] Support Convolution of CNN with Compression Sparse Matrix Multiplication Flow in TVM
    Liao, Hui-Hsin
    Lee, Chao-Lin
    Lee, Jenq-Kuen
    Lai, Wei-Chih
    Hung, Ming-Yu
    Huang, Chung-Wen
    50TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING WORKSHOP PROCEEDINGS - ICPP WORKSHOPS '21, 2021,
  • [6] XVDPU: A High-Performance CNN Accelerator on the Versal Platform Powered by the AI Engine
    Jia, Xijie
    Zhang, Yu
    Liu, Guangdong
    Yang, Xinlin
    Zhang, Tianyu
    Zheng, Jia
    Xu, Dongdong
    Liu, Zhuohuan
    Liu, Mengke
    Yan, Xiaoyang
    Wang, Hong
    Zheng, Rongzhang
    Wang, Li
    Li, Dong
    Pareek, Satyaprakash
    Weng, Jian
    Tian, Lu
    Xie, Dongliang
    Luo, Hong
    Shan, Yi
    ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS, 2024, 17 (02)
  • [7] An Efficient Gustavson-Based Sparse Matrix-Matrix Multiplication Accelerator on Embedded FPGAs
    Li, Shiqing
    Huai, Shuo
    Liu, Weichen
    IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2023, 42 (12) : 4671 - 4680
  • [8] GoSPA: An Energy-efficient High-performance Globally Optimized SParse Convolutional Neural Network Accelerator
    Deng, Chunhua
    Sui, Yang
    Liao, Siyu
    Qian, Xuehai
    Yuan, Bo
    2021 ACM/IEEE 48TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2021), 2021, : 1110 - 1123
  • [9] GAS: General-Purpose In-Memory-Computing Accelerator for Sparse Matrix Multiplication
    Zhang, Xiaoyu
    Li, Zerun
    Liu, Rui
    Chen, Xiaoming
    Han, Yinhe
    IEEE TRANSACTIONS ON COMPUTERS, 2024, 73 (06) : 1427 - 1441
  • [10] SPRINT: A High-Performance, Energy-Efficient, and Scalable Chiplet-Based Accelerator With Photonic Interconnects for CNN Inference
    Li, Yuan
    Louri, Ahmed
    Karanth, Avinash
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (10) : 2332 - 2345