XVDPU: A High-Performance CNN Accelerator on the Versal Platform Powered by the AI Engine

被引:2
|
作者
Jia, Xijie [1 ]
Zhang, Yu [1 ]
Liu, Guangdong [1 ]
Yang, Xinlin [1 ]
Zhang, Tianyu [1 ]
Zheng, Jia [1 ]
Xu, Dongdong [1 ]
Liu, Zhuohuan [1 ]
Liu, Mengke [1 ]
Yan, Xiaoyang [1 ]
Wang, Hong [1 ]
Zheng, Rongzhang [1 ]
Wang, Li [1 ]
Li, Dong [1 ]
Pareek, Satyaprakash [1 ]
Weng, Jian [1 ]
Tian, Lu [1 ]
Xie, Dongliang [1 ]
Luo, Hong [1 ]
Shan, Yi [2 ]
机构
[1] AMD, 15F Block B China Overseas Int Ctr,Bldg 5 5 Yard, Beijing 100029, Peoples R China
[2] PhiGent Robot, 25F,Tower B,Tsinghua Tongfang High Tech Plaza,1 W, Beijing 100083, Peoples R China
关键词
ACAP; acceleration; AI Engine; ALU engine; CNN; FPGA; hardware heterogeneous architecture; Versal; IMAGE SUPERRESOLUTION;
D O I
10.1145/3617836
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Today, convolutional neural networks (CNNs) are widely used in computer vision applications. However, the trends of higher accuracy and higher resolution generate larger networks. The requirements of computation or I/O are the key bottlenecks. In this article, we propose XVDPU: the AI Engine (AIE)-based CNN accelerator on Versal chips to meet heavy computation requirements. To resolve the IO bottleneck, we adopt several techniques to improve data reuse and reduce I/O requirements. An arithmetic logic unit is further proposed that can better balance resource utilization, new feature support, and efficiency of the whole system. We have successfully deployed more than 100 CNN models with our accelerator. Our experimental results show that the 96-AIE-core implementation can achieve 1,653 frames per second (FPS) for ResNet50 on VCK190, which is 9.8x faster than the design on ZCU102 running at 168.5 FPS. The 256-AIE-core implementation can further achieve 4,050 FPS. We propose a tilling strategy to achieve feature-map-stationary for high-definition CNN with the accelerator, achieving 3.8x FPS improvement on the residual channel attention network and 3.1x on super-efficient super-resolution. This accelerator can also solve the 3D convolution task in disparity estimation, achieving end-to-end performance of 10.1 FPS with all the optimizations.
引用
收藏
页数:24
相关论文
共 50 条
  • [1] XVDPU: A High Performance CNN Accelerator on the Versal Platform Powered by the AI Engine
    Jia, Xijie
    Zhang, Yu
    Liu, Guangdong
    Yang, Xinlin
    Zhang, Tianyu
    Zheng, Jia
    Xu, Dongdong
    Wang, Hong
    Zheng, Rongzhang
    Pareek, Satyaprakash
    Tian, Lu
    Xie, Dongliang
    Luo, Hong
    Shan, Yi
    2022 32ND INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE LOGIC AND APPLICATIONS, FPL, 2022, : 209 - 217
  • [2] Efficient Number Theoretic Transform accelerator on the versal platform powered by the AI Engine
    Bao, Zhenshan
    Zang, Tianhao
    Liu, Yiqi
    Zhang, Wenbo
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2025, 166
  • [3] High Performance Accelerator for CNN Applications
    Kyriakos, Angelos
    Kitsakis, Vasileios
    Louropoulos, Alexandros
    Papatheofanous, Elissaios-Alexios
    Patronas, Ioannis
    Reisis, Dionysios
    2019 IEEE 29TH INTERNATIONAL SYMPOSIUM ON POWER AND TIMING MODELING, OPTIMIZATION AND SIMULATION (PATMOS 2019), 2019, : 135 - 140
  • [4] A-U3D: A Unified 2D/3D CNN Accelerator on the Versal Platform for Disparity Estimation
    Zhang, Tianyu
    Li, Dong
    Wang, Hong
    Li, Yunzhi
    Ma, Xiang
    Luo, Wei
    Wang, Yu
    Huang, Yang
    Li, Yi
    Zhang, Yu
    Yang, Xinlin
    Jia, Xijie
    Lin, Qiang
    Tian, Lu
    Jiang, Fan
    Xie, Dongliang
    Luo, Hong
    Shan, Yi
    2022 32ND INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE LOGIC AND APPLICATIONS, FPL, 2022, : 123 - 129
  • [5] SWM: A High-Performance Sparse-Winograd Matrix Multiplication CNN Accelerator
    Wu, Di
    Fan, Xitian
    Cao, Wei
    Wang, Lingli
    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2021, 29 (05) : 936 - 949
  • [6] AOS: An Automated Overclocking System for High-Performance CNN Accelerator Through Timing Delay Measurement on FPGA
    Jiang, Weixiong
    Yu, Heng
    Chen, Fupeng
    Ha, Yajun
    IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2023, 42 (09) : 2952 - 2965
  • [7] GenSeq plus : A Scalable High-Performance Accelerator for Genome Sequencing
    Wang, Chao
    Gong, Lei
    Lei, Shiming
    Fang, Haijie
    Li, Xi
    Wang, Aili
    Zhou, Xuehai
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2021, 18 (04) : 1512 - 1523
  • [8] Enabling FPGA and AI Engine Tasks in the HPX Programming Framework for Heterogeneous High-Performance Computing
    Kalkhof, Torben
    Heinz, Carsten
    Koch, Andreas
    APPLIED RECONFIGURABLE COMPUTING. ARCHITECTURES, TOOLS, AND APPLICATIONS, ARC 2024, 2024, 14553 : 75 - 89
  • [9] A High-performance CNN Processor Based on FPGA for MobileNets
    Wu, Di
    Zhang, Yu
    Jia, Xijie
    Tian, Lu
    Li, Tianping
    Sui, Lingzhi
    Xie, Dongliang
    Shan, Yi
    2019 29TH INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE LOGIC AND APPLICATIONS (FPL), 2019, : 136 - 143
  • [10] High-performance pipeline architecture for packet classification accelerator in DPU
    Tan, Jing
    Lv, GaoFeng
    Ma, Yanni
    Qiao, GuanJie
    2021 INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE TECHNOLOGY (ICFPT), 2021, : 286 - 289