XVDPU: A High-Performance CNN Accelerator on the Versal Platform Powered by the AI Engine

被引：2

作者：

Jia, Xijie ^{[1
]}

Zhang, Yu ^{[1
]}

Liu, Guangdong ^{[1
]}

Yang, Xinlin ^{[1
]}

Zhang, Tianyu ^{[1
]}

Zheng, Jia ^{[1
]}

Xu, Dongdong ^{[1
]}

Liu, Zhuohuan ^{[1
]}

Liu, Mengke ^{[1
]}

Yan, Xiaoyang ^{[1
]}

Wang, Hong ^{[1
]}

Zheng, Rongzhang ^{[1
]}

Wang, Li ^{[1
]}

Li, Dong ^{[1
]}

Pareek, Satyaprakash ^{[1
]}

Weng, Jian ^{[1
]}

Tian, Lu ^{[1
]}

Xie, Dongliang ^{[1
]}

Luo, Hong ^{[1
]}

Shan, Yi ^{[2
]}

机构：

[1] AMD, 15F Block B China Overseas Int Ctr,Bldg 5 5 Yard, Beijing 100029, Peoples R China

[2] PhiGent Robot, 25F,Tower B,Tsinghua Tongfang High Tech Plaza,1 W, Beijing 100083, Peoples R China

来源：

ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS | 2024年 / 17卷 / 02期

关键词：

ACAP; acceleration; AI Engine; ALU engine; CNN; FPGA; hardware heterogeneous architecture; Versal; IMAGE SUPERRESOLUTION;

D O I：

10.1145/3617836

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Today, convolutional neural networks (CNNs) are widely used in computer vision applications. However, the trends of higher accuracy and higher resolution generate larger networks. The requirements of computation or I/O are the key bottlenecks. In this article, we propose XVDPU: the AI Engine (AIE)-based CNN accelerator on Versal chips to meet heavy computation requirements. To resolve the IO bottleneck, we adopt several techniques to improve data reuse and reduce I/O requirements. An arithmetic logic unit is further proposed that can better balance resource utilization, new feature support, and efficiency of the whole system. We have successfully deployed more than 100 CNN models with our accelerator. Our experimental results show that the 96-AIE-core implementation can achieve 1,653 frames per second (FPS) for ResNet50 on VCK190, which is 9.8x faster than the design on ZCU102 running at 168.5 FPS. The 256-AIE-core implementation can further achieve 4,050 FPS. We propose a tilling strategy to achieve feature-map-stationary for high-definition CNN with the accelerator, achieving 3.8x FPS improvement on the residual channel attention network and 3.1x on super-efficient super-resolution. This accelerator can also solve the 3D convolution task in disparity estimation, achieving end-to-end performance of 10.1 FPS with all the optimizations.

引用

页数：24

共 50 条

[31] DESIGN of a spaceborne high-performance and real-time image processing platform
Pan Zheng
Feng Xingtai
Peng Chengxiang
INTERNATIONAL CONFERENCE ON OPTICAL AND PHOTONIC ENGINEERING, ICOPEN 2022, 2022, 12550
[32] DESIGN of a spaceborne high-performance and real-time image processing platform
Pan Zheng
Feng Xingtai
Peng Chengxiang
AOPC 2022: OPTICAL SENSING, IMAGING, AND DISPLAY TECHNOLOGY, 2022, 12557
[33] BSTMSM: A High-Performance FPGA-based Multi-Scalar Multiplication Hardware Accelerator
Zhao, Baoze
Huang, Wenjin
Li, Tianrui
Huang, Yihua
2023 INTERNATIONAL CONFERENCE ON FIELD PROGRAMMABLE TECHNOLOGY, ICFPT, 2023, : 35 - 43
[34] Generating High-Performance FPGA Accelerator Designs for Big Data Analytics with Fletcher and Apache Arrow
Peltenburg, Johan
van Straten, Jeroen
Brobbel, Matthijs
Al-Ars, Zaid
Hofstee, H. Peter
JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2021, 93 (05): : 565 - 586
[35] Generating High-Performance FPGA Accelerator Designs for Big Data Analytics with Fletcher and Apache Arrow
Johan Peltenburg
Jeroen van Straten
Matthijs Brobbel
Zaid Al-Ars
H. Peter Hofstee
Journal of Signal Processing Systems, 2021, 93 : 565 - 586
[36] High-Performance Computation of LGCA Fluid Dynamics on an FPGA-Based Platform
Du, Changdao
Firmansyah, Iman
Yamaguchi, Yoshiki
2020 5TH INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION SYSTEMS (ICCCS 2020), 2020, : 520 - 525
[37] TopSort: A High-Performance Two-Phase Sorting Accelerator Optimized on HBM-Based FPGAs
Qiao, Weikang
Guo, Licheng
Fang, Zhenman
Chang, Mau-Chung Frank
Cong, Jason
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, 2023, 11 (02) : 404 - 419
[38] A real-time and high-performance MobileNet accelerator based on adaptive dataflow scheduling for image classification
Sang, Xiaoting
Ruan, Tao
Li, Chunlei
Li, Huanyu
Yang, Ruimin
Liu, Zhoufeng
JOURNAL OF REAL-TIME IMAGE PROCESSING, 2024, 21 (01)
[39] SENTIOF: An FPGA Based High-Performance and Low-Power Wireless Embedded Platform
Shahzad, Khurram
Cheng, Peng
Oelmann, Bengt
2013 FEDERATED CONFERENCE ON COMPUTER SCIENCE AND INFORMATION SYSTEMS (FEDCSIS), 2013, : 901 - 906
[40] UniBoard2, A Generic Scalable High-Performance Computing Platform for Radio Astronomy
Schoonderbeek, G. W.
Szomoru, A.
Gunst, A. W.
Hiemstra, L.
Hargreaves, J.
JOURNAL OF ASTRONOMICAL INSTRUMENTATION, 2019, 8 (02)

← 1 2 3 4 5 →