A Stride-Based Convolution Decomposition Method to Stretch CNN Acceleration Algorithms for Efficient and Flexible Hardware Implementation

被引:29
作者
Yang, Chen [1 ]
Wang, Yizhou [1 ]
Wang, Xiaoli [1 ]
Geng, Li [1 ]
机构
[1] Xi An Jiao Tong Univ, Sch Microelect, Xian 710049, Peoples R China
基金
中国博士后科学基金;
关键词
Convolutional neural networks; acceleration algorithm; convolution decomposition; flexibility; hardware-efficient; NEURAL-NETWORK PROCESSOR;
D O I
10.1109/TCSI.2020.2985727
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
To reduce multiplication operations in convolution of convolutional neural networks (CNNs), there are three widely used convolutional acceleration algorithms, i.e., Winograd, FFT and FFA. However, current accelerators based on these convolutional acceleration algorithms have issues on flexibility and efficiency. Firstly, some accelerators utilized a combination of these acceleration algorithms and employed multiple types of computational units to achieve their respective advantages. As a result, some computational units are left unused when the best-performing unit is working, which causes much area inefficiency. Secondly, current accelerators tend to choose small parameters of these convolutional acceleration algorithms to avoid unacceptable precision loss, as a result, they are hardly to support large kernel sizes and lack of flexibility. Thirdly, these acceleration algorithms are typically presented for 1-stride convolutions, consequently, few implementation considers the acceleration of large-stride convolutions, which is a major restriction to hardware flexibility. This paper proposed a stride-based convolution decomposition method (SCDM) to reform different convolution shapes (i.e., kernel sizes & strides) to an identical pattern. With the aid of SCDM, a Winograd-stretched and hardware-efficient design (WHD) is presented to utilize one uniform computational unit for the acceleration of different convolution shapes, which combines complementary performance advantages on both Winograd F(4,3)andF(4,2) units. Compared to current FFT-based or FFA-based works, WHD can stretch the use range of Winograd and simplify implementation, thereby achieving hardware flexibility and efficiency. Evaluation results show that 34.08%similar to 55.41% operation reduction were achieved on six CNN models, while incurring a slight hardware overhead.
引用
收藏
页码:3007 / 3020
页数:14
相关论文
共 43 条
[1]   Accelerating Convolutional Neural Network With FFT on Embedded Hardware [J].
Abtahi, Tahmid ;
Shea, Colin ;
Kulkarni, Amey ;
Mohsenin, Tinoosh .
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2018, 26 (09) :1737-1749
[2]   Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing [J].
Albericio, Jorge ;
Judd, Patrick ;
Hetherington, Tayler ;
Aamodt, Tor ;
Jerger, Natalie Enright ;
Moshovos, Andreas .
2016 ACM/IEEE 43RD ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2016, :1-13
[3]  
Arbeláez P, 2012, PROC CVPR IEEE, P3378, DOI 10.1109/CVPR.2012.6248077
[4]   Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks [J].
Chen, Yu-Hsin ;
Krishna, Tushar ;
Emer, Joel S. ;
Sze, Vivienne .
IEEE JOURNAL OF SOLID-STATE CIRCUITS, 2017, 52 (01) :127-138
[5]   Rich feature hierarchies for accurate object detection and semantic segmentation [J].
Girshick, Ross ;
Donahue, Jeff ;
Darrell, Trevor ;
Malik, Jitendra .
2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :580-587
[6]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[7]  
HEMNANI M, 2015, P INT C COMP COMM CO, P1
[8]  
Howard A.G., 2017, bile Vision Applications
[9]   In-Datacenter Performance Analysis of a Tensor Processing Unit [J].
Jouppi, Norman P. ;
Young, Cliff ;
Patil, Nishant ;
Patterson, David ;
Agrawal, Gaurav ;
Bajwa, Raminder ;
Bates, Sarah ;
Bhatia, Suresh ;
Boden, Nan ;
Borchers, Al ;
Boyle, Rick ;
Cantin, Pierre-luc ;
Chao, Clifford ;
Clark, Chris ;
Coriell, Jeremy ;
Daley, Mike ;
Dau, Matt ;
Dean, Jeffrey ;
Gelb, Ben ;
Ghaemmaghami, Tara Vazir ;
Gottipati, Rajendra ;
Gulland, William ;
Hagmann, Robert ;
Ho, C. Richard ;
Hogberg, Doug ;
Hu, John ;
Hundt, Robert ;
Hurt, Dan ;
Ibarz, Julian ;
Jaffey, Aaron ;
Jaworski, Alek ;
Kaplan, Alexander ;
Khaitan, Harshit ;
Killebrew, Daniel ;
Koch, Andy ;
Kumar, Naveen ;
Lacy, Steve ;
Laudon, James ;
Law, James ;
Le, Diemthu ;
Leary, Chris ;
Liu, Zhuyuan ;
Lucke, Kyle ;
Lundin, Alan ;
MacKean, Gordon ;
Maggiore, Adriana ;
Mahony, Maire ;
Miller, Kieran ;
Nagarajan, Rahul ;
Narayanaswami, Ravi .
44TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2017), 2017, :1-12
[10]   High-Performance CNN Accelerator on FPGA Using Unified Winograd-GEMM Architecture [J].
Kala, S. ;
Jose, Babita R. ;
Mathew, Jimson ;
Nalesh, S. .
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2019, 27 (12) :2816-2828