XVDPU: A High-Performance CNN Accelerator on the Versal Platform Powered by the AI Engine

被引:2
作者
Jia, Xijie [1 ]
Zhang, Yu [1 ]
Liu, Guangdong [1 ]
Yang, Xinlin [1 ]
Zhang, Tianyu [1 ]
Zheng, Jia [1 ]
Xu, Dongdong [1 ]
Liu, Zhuohuan [1 ]
Liu, Mengke [1 ]
Yan, Xiaoyang [1 ]
Wang, Hong [1 ]
Zheng, Rongzhang [1 ]
Wang, Li [1 ]
Li, Dong [1 ]
Pareek, Satyaprakash [1 ]
Weng, Jian [1 ]
Tian, Lu [1 ]
Xie, Dongliang [1 ]
Luo, Hong [1 ]
Shan, Yi [2 ]
机构
[1] AMD, 15F Block B China Overseas Int Ctr,Bldg 5 5 Yard, Beijing 100029, Peoples R China
[2] PhiGent Robot, 25F,Tower B,Tsinghua Tongfang High Tech Plaza,1 W, Beijing 100083, Peoples R China
关键词
ACAP; acceleration; AI Engine; ALU engine; CNN; FPGA; hardware heterogeneous architecture; Versal; IMAGE SUPERRESOLUTION;
D O I
10.1145/3617836
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Today, convolutional neural networks (CNNs) are widely used in computer vision applications. However, the trends of higher accuracy and higher resolution generate larger networks. The requirements of computation or I/O are the key bottlenecks. In this article, we propose XVDPU: the AI Engine (AIE)-based CNN accelerator on Versal chips to meet heavy computation requirements. To resolve the IO bottleneck, we adopt several techniques to improve data reuse and reduce I/O requirements. An arithmetic logic unit is further proposed that can better balance resource utilization, new feature support, and efficiency of the whole system. We have successfully deployed more than 100 CNN models with our accelerator. Our experimental results show that the 96-AIE-core implementation can achieve 1,653 frames per second (FPS) for ResNet50 on VCK190, which is 9.8x faster than the design on ZCU102 running at 168.5 FPS. The 256-AIE-core implementation can further achieve 4,050 FPS. We propose a tilling strategy to achieve feature-map-stationary for high-definition CNN with the accelerator, achieving 3.8x FPS improvement on the residual channel attention network and 3.1x on super-efficient super-resolution. This accelerator can also solve the 3D convolution task in disparity estimation, achieving end-to-end performance of 10.1 FPS with all the optimizations.
引用
收藏
页数:24
相关论文
共 50 条
  • [31] DESIGN of a spaceborne high-performance and real-time image processing platform
    Pan Zheng
    Feng Xingtai
    Peng Chengxiang
    INTERNATIONAL CONFERENCE ON OPTICAL AND PHOTONIC ENGINEERING, ICOPEN 2022, 2022, 12550
  • [32] DESIGN of a spaceborne high-performance and real-time image processing platform
    Pan Zheng
    Feng Xingtai
    Peng Chengxiang
    AOPC 2022: OPTICAL SENSING, IMAGING, AND DISPLAY TECHNOLOGY, 2022, 12557
  • [33] BSTMSM: A High-Performance FPGA-based Multi-Scalar Multiplication Hardware Accelerator
    Zhao, Baoze
    Huang, Wenjin
    Li, Tianrui
    Huang, Yihua
    2023 INTERNATIONAL CONFERENCE ON FIELD PROGRAMMABLE TECHNOLOGY, ICFPT, 2023, : 35 - 43
  • [34] Generating High-Performance FPGA Accelerator Designs for Big Data Analytics with Fletcher and Apache Arrow
    Peltenburg, Johan
    van Straten, Jeroen
    Brobbel, Matthijs
    Al-Ars, Zaid
    Hofstee, H. Peter
    JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2021, 93 (05): : 565 - 586
  • [35] Generating High-Performance FPGA Accelerator Designs for Big Data Analytics with Fletcher and Apache Arrow
    Johan Peltenburg
    Jeroen van Straten
    Matthijs Brobbel
    Zaid Al-Ars
    H. Peter Hofstee
    Journal of Signal Processing Systems, 2021, 93 : 565 - 586
  • [36] High-Performance Computation of LGCA Fluid Dynamics on an FPGA-Based Platform
    Du, Changdao
    Firmansyah, Iman
    Yamaguchi, Yoshiki
    2020 5TH INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION SYSTEMS (ICCCS 2020), 2020, : 520 - 525
  • [37] TopSort: A High-Performance Two-Phase Sorting Accelerator Optimized on HBM-Based FPGAs
    Qiao, Weikang
    Guo, Licheng
    Fang, Zhenman
    Chang, Mau-Chung Frank
    Cong, Jason
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, 2023, 11 (02) : 404 - 419
  • [38] A real-time and high-performance MobileNet accelerator based on adaptive dataflow scheduling for image classification
    Sang, Xiaoting
    Ruan, Tao
    Li, Chunlei
    Li, Huanyu
    Yang, Ruimin
    Liu, Zhoufeng
    JOURNAL OF REAL-TIME IMAGE PROCESSING, 2024, 21 (01)
  • [39] SENTIOF: An FPGA Based High-Performance and Low-Power Wireless Embedded Platform
    Shahzad, Khurram
    Cheng, Peng
    Oelmann, Bengt
    2013 FEDERATED CONFERENCE ON COMPUTER SCIENCE AND INFORMATION SYSTEMS (FEDCSIS), 2013, : 901 - 906
  • [40] UniBoard2, A Generic Scalable High-Performance Computing Platform for Radio Astronomy
    Schoonderbeek, G. W.
    Szomoru, A.
    Gunst, A. W.
    Hiemstra, L.
    Hargreaves, J.
    JOURNAL OF ASTRONOMICAL INSTRUMENTATION, 2019, 8 (02)