FlexCNN: An End-to-end Framework for Composing CNN Accelerators on FPGA

被引:24
作者
Basalama, Suhail [1 ]
Sohrabizadeh, Atefeh [1 ]
Wang, Jie [1 ]
Guo, Licheng [1 ]
Cong, Jason [1 ]
机构
[1] Univ Calif Los Angeles, 404 Westwood Blvd Engn,6 Room 468, Los Angeles, CA 90095 USA
关键词
FPGA; CNN; ONNX; systolic array; transposed convolution; dilated convolution; OpenPose; U-Net; E-Net;
D O I
10.1145/3570928
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
With reduced data reuse and parallelism, recent convolutional neural networks (CNNs) create new challenges for FPGA acceleration. Systolic arrays (SAs) are efficient, scalable architectures for convolutional layers, but without proper optimizations, their efficiency drops dramatically for reasons: (1) the different dimensions within same-type layers, (2) the different convolution layers especially transposed and dilated convolutions, and (3) CNN's complex dataflow graph. Furthermore, significant overheads arise when integrating FPGAs into machine learning frameworks. Therefore, we present a flexible, composable architecture called FlexCNN, which delivers high computation efficiency by employing dynamic tiling, layer fusion, and data layout optimizations. Additionally, we implement a novel versatile SA to process normal, transposed, and dilated convolutions efficiently. FlexCNN also uses a fully pipelined software-hardware integration that alleviates the software overheads. Moreover, with an automated compilation flow, FlexCNN takes a CNN in the ONNX1 representation, performs a design space exploration, and generates an FPGA accelerator. The framework is tested using three complex CNNs: OpenPose, U-Net, and E-Net. The architecture optimizations achieve 2.3x performance improvement. Compared to a standard SA, the versatile SA achieves close-to-ideal speedups, with up to 15.98x and 13.42x for transposed and dilated convolutions, with a 6% average area overhead. The pipelined integration leads to a 5x speedup for OpenPose.
引用
收藏
页数:32
相关论文
共 64 条
[1]  
Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
[2]  
[Anonymous], 2018, FSP Workshop 2018
[3]  
Fifth International Workshop on FPGAs for Software Programmers. VDE
[4]   A CNN Accelerator on FPGA Using Depthwise Separable Convolution [J].
Bai, Lin ;
Zhao, Yiming ;
Huang, Xinming .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2018, 65 (10) :1415-1419
[5]   Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields [J].
Cao, Zhe ;
Simon, Tomas ;
Wei, Shih-En ;
Sheikh, Yaser .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1302-1310
[6]  
Chang Kuo-Wei, 2020, IEEE INT S CIRCUITS, P1
[7]   An Efficient Accelerator for Multiple Convolutions From the Sparsity Perspective [J].
Chen, Qinyu ;
Huang, Yan ;
Sun, Rui ;
Song, Wenqing ;
Lu, Zhonghai ;
Fu, Yuxiang ;
Li, Li .
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2020, 28 (06) :1540-1544
[8]   Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs [J].
Chen, Yao ;
He, Jiong ;
Zhang, Xiaofan ;
Hao, Cong ;
Chen, Deming .
PROCEEDINGS OF THE 2019 ACM/SIGDA INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE GATE ARRAYS (FPGA'19), 2019, :73-82
[9]   When Spark Meets FPGAs: A Case Study for Next-Generation DNA Sequencing Acceleration [J].
Chen, Yu-Ting ;
Cong, Jason ;
Fang, Zhenman ;
Lei, Jie ;
Wei, Peng .
2016 IEEE 24TH ANNUAL INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES (FCCM), 2016, :29-29
[10]  
Chetlur S, 2014, Arxiv, DOI arXiv:1410.0759