XVDPU: A High-Performance CNN Accelerator on the Versal Platform Powered by the AI Engine

被引：2

作者：

Jia, Xijie ^{[1
]}

Zhang, Yu ^{[1
]}

Liu, Guangdong ^{[1
]}

Yang, Xinlin ^{[1
]}

Zhang, Tianyu ^{[1
]}

Zheng, Jia ^{[1
]}

Xu, Dongdong ^{[1
]}

Liu, Zhuohuan ^{[1
]}

Liu, Mengke ^{[1
]}

Yan, Xiaoyang ^{[1
]}

Wang, Hong ^{[1
]}

Zheng, Rongzhang ^{[1
]}

Wang, Li ^{[1
]}

Li, Dong ^{[1
]}

Pareek, Satyaprakash ^{[1
]}

Weng, Jian ^{[1
]}

Tian, Lu ^{[1
]}

Xie, Dongliang ^{[1
]}

Luo, Hong ^{[1
]}

Shan, Yi ^{[2
]}

机构：

[1] AMD, 15F Block B China Overseas Int Ctr,Bldg 5 5 Yard, Beijing 100029, Peoples R China

[2] PhiGent Robot, 25F,Tower B,Tsinghua Tongfang High Tech Plaza,1 W, Beijing 100083, Peoples R China

来源：

ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS | 2024年 / 17卷 / 02期

关键词：

ACAP; acceleration; AI Engine; ALU engine; CNN; FPGA; hardware heterogeneous architecture; Versal; IMAGE SUPERRESOLUTION;

D O I：

10.1145/3617836

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Today, convolutional neural networks (CNNs) are widely used in computer vision applications. However, the trends of higher accuracy and higher resolution generate larger networks. The requirements of computation or I/O are the key bottlenecks. In this article, we propose XVDPU: the AI Engine (AIE)-based CNN accelerator on Versal chips to meet heavy computation requirements. To resolve the IO bottleneck, we adopt several techniques to improve data reuse and reduce I/O requirements. An arithmetic logic unit is further proposed that can better balance resource utilization, new feature support, and efficiency of the whole system. We have successfully deployed more than 100 CNN models with our accelerator. Our experimental results show that the 96-AIE-core implementation can achieve 1,653 frames per second (FPS) for ResNet50 on VCK190, which is 9.8x faster than the design on ZCU102 running at 168.5 FPS. The 256-AIE-core implementation can further achieve 4,050 FPS. We propose a tilling strategy to achieve feature-map-stationary for high-definition CNN with the accelerator, achieving 3.8x FPS improvement on the residual channel attention network and 3.1x on super-efficient super-resolution. This accelerator can also solve the 3D convolution task in disparity estimation, achieving end-to-end performance of 10.1 FPS with all the optimizations.

引用

页数：24

共 50 条

[1] XVDPU: A High Performance CNN Accelerator on the Versal Platform Powered by the AI Engine
Jia, Xijie
Zhang, Yu
Liu, Guangdong
Yang, Xinlin
Zhang, Tianyu
Zheng, Jia
Xu, Dongdong
Wang, Hong
Zheng, Rongzhang
Pareek, Satyaprakash
Tian, Lu
Xie, Dongliang
Luo, Hong
Shan, Yi
2022 32ND INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE LOGIC AND APPLICATIONS, FPL, 2022, : 209 - 217
[2] Efficient Number Theoretic Transform accelerator on the versal platform powered by the AI Engine
Bao, Zhenshan
Zang, Tianhao
Liu, Yiqi
Zhang, Wenbo
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2025, 166
[3] High Performance Accelerator for CNN Applications
Kyriakos, Angelos
Kitsakis, Vasileios
Louropoulos, Alexandros
Papatheofanous, Elissaios-Alexios
Patronas, Ioannis
Reisis, Dionysios
2019 IEEE 29TH INTERNATIONAL SYMPOSIUM ON POWER AND TIMING MODELING, OPTIMIZATION AND SIMULATION (PATMOS 2019), 2019, : 135 - 140
[4] A-U3D: A Unified 2D/3D CNN Accelerator on the Versal Platform for Disparity Estimation
Zhang, Tianyu
Li, Dong
Wang, Hong
Li, Yunzhi
Ma, Xiang
Luo, Wei
Wang, Yu
Huang, Yang
Li, Yi
Zhang, Yu
Yang, Xinlin
Jia, Xijie
Lin, Qiang
Tian, Lu
Jiang, Fan
Xie, Dongliang
Luo, Hong
Shan, Yi
2022 32ND INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE LOGIC AND APPLICATIONS, FPL, 2022, : 123 - 129
[5] SWM: A High-Performance Sparse-Winograd Matrix Multiplication CNN Accelerator
Wu, Di
Fan, Xitian
Cao, Wei
Wang, Lingli
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2021, 29 (05) : 936 - 949
[6] AOS: An Automated Overclocking System for High-Performance CNN Accelerator Through Timing Delay Measurement on FPGA
Jiang, Weixiong
Yu, Heng
Chen, Fupeng
Ha, Yajun
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2023, 42 (09) : 2952 - 2965
[7] GenSeq plus : A Scalable High-Performance Accelerator for Genome Sequencing
Wang, Chao
Gong, Lei
Lei, Shiming
Fang, Haijie
Li, Xi
Wang, Aili
Zhou, Xuehai
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2021, 18 (04) : 1512 - 1523
[8] Enabling FPGA and AI Engine Tasks in the HPX Programming Framework for Heterogeneous High-Performance Computing
Kalkhof, Torben
Heinz, Carsten
Koch, Andreas
APPLIED RECONFIGURABLE COMPUTING. ARCHITECTURES, TOOLS, AND APPLICATIONS, ARC 2024, 2024, 14553 : 75 - 89
[9] A High-performance CNN Processor Based on FPGA for MobileNets
Wu, Di
Zhang, Yu
Jia, Xijie
Tian, Lu
Li, Tianping
Sui, Lingzhi
Xie, Dongliang
Shan, Yi
2019 29TH INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE LOGIC AND APPLICATIONS (FPL), 2019, : 136 - 143
[10] High-performance pipeline architecture for packet classification accelerator in DPU
Tan, Jing
Lv, GaoFeng
Ma, Yanni
Qiao, GuanJie
2021 INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE TECHNOLOGY (ICFPT), 2021, : 286 - 289

← 1 2 3 4 5 →