SWM: A High-Performance Sparse-Winograd Matrix Multiplication CNN Accelerator

被引：22

作者：

Wu, Di ^{[1
]}

Fan, Xitian ^{[2
]}

Cao, Wei ^{[3
]}

Wang, Lingli ^{[3
]}

机构：

[1] Fudan Univ, State Key Lab Applicat Specif Integrated Circuit, Shanghai 201203, Peoples R China

[2] Fudan Univ, Sch Comp Sci, Shanghai 201203, Peoples R China

[3] Fudan Univ, Sch Microelect, Shanghai 201203, Peoples R China

来源：

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS | 2021年 / 29卷 / 05期

基金：

中国国家自然科学基金;

关键词：

Convolution; Sparse matrices; Acceleration; Load modeling; Kernel; Inference algorithms; Very large scale integration; Convolutional neural network (CNN) acceleration; convolution partition; load balance; sparse; Winograd transformation; ARCHITECTURE;

D O I：

10.1109/TVLSI.2021.3060041

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Many convolutional neural network (CNN) accelerators are proposed to exploit the sparsity of the networks recently to enjoy the benefits of both computation and memory reduction. However, most accelerators cannot exploit the sparsity of both activations and weights. For those works that exploit both sparsity opportunities, they cannot achieve the stable load balance through a static scheduling (SS) strategy, which is vulnerable to the sparsity distribution. In this work, a balanced compressed sparse row format and a dynamic scheduling strategy are proposed to improve the load balance. A set-associate structure is also presented to tradeoff the load balance and hardware resource overhead. We propose SWM to accelerate the CNN inference, which supports both sparse convolution and sparse fully connected (FC) layers. SWM provides Winograd adaptability for large convolution kernels and supports both 16-bit and 8-bit quantized CNNs. Due to the activation sharing, 8-bit processing can achieve theoretically twice the performance of the 16-bit processing with the same sparsity. The architecture is evaluated with VGG16 and ResNet50, which achieves: at most 7.6 TOP/s for sparse-Winograd convolution and three TOP/s for sparse matrix multiplication with 16-bit quantization on Xilinx VCU1525 platform. SWM can process 310/725 images per second for VGG16/ResNet50 with 16-bit quantization. Compared with the state-of-the-art works, our design can achieve at least 1.53x speedup and 1.8x energy efficiency improvement.

引用

页码：936 / 949

页数：14

共 40 条

[31] S2 REDUCER: HIGH-PERFORMANCE SPARSE COMMUNICATION TO ACCELERATE DISTRIBUTED DEEP LEARNING
Ge, Keshi
Fu, Yongquan
Zhang, Yiming
Lai, Zhiquan
Deng, Xiaoge
Li, Dongsheng
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 5233 - 5237
[32] COPMA: Compact and Optimized Polynomial Multiplier Accelerator for High-Performance Implementation of LWR-Based PQC
He, Pengzhou
Tu, Yazheng
Bao, Tianyou
Sousa, Leonel
Xie, Jiafeng
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2023, 31 (04) : 596 - 600
[33] RSD-based high-performance radix-4 Montgomery Modular Multiplication for Elliptic Curve Cryptography
Zhao, Shilei
Zheng, Jiwen
Shao, Yutong
Huang, Hai
Liu, Zhiwei
Yu, Bin
Zhang, Ziyue
MICROELECTRONICS JOURNAL, 2024, 153
[34] An Efficient CNN Accelerator Achieving High PE Utilization Using a Dense-/Sparse-Aware Redundancy Reduction Method and Data-Index Decoupling Workflow
Meng, Yishuo
Yang, Chen
Xiang, Siwei
Wang, Jianfei
Mei, Kuizhi
Geng, Li
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2023, 31 (10) : 1537 - 1550
[35] Hardware Accelerator Integration Tradeoffs for High-Performance Computing: A Case Study of GEMM Acceleration in N-Body Methods
Asri, Mochamad
Malhotra, Dhairya
Wang, Jiajun
Biros, George
John, Lizy K.
Gerstlauer, Andreas
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (08) : 2035 - 2048
[36] FireFly v2: Advancing Hardware Support for High-Performance Spiking Neural Network With a Spatiotemporal FPGA Accelerator
Li, Jindong
Shen, Guobin
Zhao, Dongcheng
Zhang, Qian
Zeng, Yi
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2024, 43 (09) : 2647 - 2660
[37] Sulfur cathode based on layered carbon matrix for high-performance Li-S batteries
Wu, Feng
Qian, Ji
Chen, Renjie
Zhao, Teng
Xu, Rui
Ye, Yusheng
Li, Wenhui
Li, Li
Lu, Jun
Amine, Khalil
NANO ENERGY, 2015, 12 : 742 - 749
[38] OPT-GCN: A Unified and Scalable Chiplet-Based Accelerator for High-Performance and Energy-Efficient GCN Computation
Zhao, Yingnan
Wang, Ke
Louri, Ahmed
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2024, 43 (12) : 4827 - 4840
[39] Hybrid parallelisation scheme for the application of distributed near-field sparse approximate inverse preconditioners on high-performance computing clusters
Delgado, Carlos
Garcia, Eliseo
Somolinos, Alvaro
Catedra, Manuel Felipe
IET MICROWAVES ANTENNAS & PROPAGATION, 2020, 14 (04) : 320 - 328
[40] Supercritical CO2 mediated incorporation of sulfur into carbon matrix as cathode materials towards high-performance lithium-sulfur batteries
Fang, Ruyi
Liang, Chu
Xia, Yang
Xiao, Zhen
Huang, Hui
Gan, Yongping
Zhang, Jun
Tao, Xinyong
Zhang, Wenkui
JOURNAL OF MATERIALS CHEMISTRY A, 2018, 6 (01) : 212 - 222

← 1 2 3 4 →