XVDPU: A High-Performance CNN Accelerator on the Versal Platform Powered by the AI Engine

被引:2
作者
Jia, Xijie [1 ]
Zhang, Yu [1 ]
Liu, Guangdong [1 ]
Yang, Xinlin [1 ]
Zhang, Tianyu [1 ]
Zheng, Jia [1 ]
Xu, Dongdong [1 ]
Liu, Zhuohuan [1 ]
Liu, Mengke [1 ]
Yan, Xiaoyang [1 ]
Wang, Hong [1 ]
Zheng, Rongzhang [1 ]
Wang, Li [1 ]
Li, Dong [1 ]
Pareek, Satyaprakash [1 ]
Weng, Jian [1 ]
Tian, Lu [1 ]
Xie, Dongliang [1 ]
Luo, Hong [1 ]
Shan, Yi [2 ]
机构
[1] AMD, 15F Block B China Overseas Int Ctr,Bldg 5 5 Yard, Beijing 100029, Peoples R China
[2] PhiGent Robot, 25F,Tower B,Tsinghua Tongfang High Tech Plaza,1 W, Beijing 100083, Peoples R China
关键词
ACAP; acceleration; AI Engine; ALU engine; CNN; FPGA; hardware heterogeneous architecture; Versal; IMAGE SUPERRESOLUTION;
D O I
10.1145/3617836
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Today, convolutional neural networks (CNNs) are widely used in computer vision applications. However, the trends of higher accuracy and higher resolution generate larger networks. The requirements of computation or I/O are the key bottlenecks. In this article, we propose XVDPU: the AI Engine (AIE)-based CNN accelerator on Versal chips to meet heavy computation requirements. To resolve the IO bottleneck, we adopt several techniques to improve data reuse and reduce I/O requirements. An arithmetic logic unit is further proposed that can better balance resource utilization, new feature support, and efficiency of the whole system. We have successfully deployed more than 100 CNN models with our accelerator. Our experimental results show that the 96-AIE-core implementation can achieve 1,653 frames per second (FPS) for ResNet50 on VCK190, which is 9.8x faster than the design on ZCU102 running at 168.5 FPS. The 256-AIE-core implementation can further achieve 4,050 FPS. We propose a tilling strategy to achieve feature-map-stationary for high-definition CNN with the accelerator, achieving 3.8x FPS improvement on the residual channel attention network and 3.1x on super-efficient super-resolution. This accelerator can also solve the 3D convolution task in disparity estimation, achieving end-to-end performance of 10.1 FPS with all the optimizations.
引用
收藏
页数:24
相关论文
共 50 条
  • [41] TopSort: A High-Performance Two-Phase Sorting Accelerator Optimized on HBM-Based FPGAs
    Qiao, Weikang
    Guo, Licheng
    Fang, Zhenman
    Chang, Mau-Chung Frank
    Cong, Jason
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, 2023, 11 (02) : 404 - 419
  • [42] A real-time and high-performance MobileNet accelerator based on adaptive dataflow scheduling for image classification
    Xiaoting Sang
    Tao Ruan
    Chunlei Li
    Huanyu Li
    Ruimin Yang
    Zhoufeng Liu
    Journal of Real-Time Image Processing, 2024, 21
  • [43] FNNG: A High-Performance FPGA-based Accelerator for K-Nearest Neighbor Graph Construction
    Liu, Chaoqiang
    Liu, Haifeng
    Zheng, Long
    Huang, Yu
    Ye, Xiangyu
    Liao, Xiaofei
    Jin, Hai
    PROCEEDINGS OF THE 2023 ACM/SIGDA INTERNATIONAL SYMPOSIUM ON FIELD PROGRAMMABLE GATE ARRAYS, FPGA 2023, 2023, : 67 - 77
  • [44] GoSPA: An Energy-efficient High-performance Globally Optimized SParse Convolutional Neural Network Accelerator
    Deng, Chunhua
    Sui, Yang
    Liao, Siyu
    Qian, Xuehai
    Yuan, Bo
    2021 ACM/IEEE 48TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2021), 2021, : 1110 - 1123
  • [45] Real-time simulation of dynamic vehicle models using a high-performance reconfigurable platform
    Monga, Madhu
    Roggow, Daniel
    Karkee, Manoj
    Sun, Song
    Tondehal, Lakshmi Kiran
    Steward, Brian
    Kelkar, Atul
    Zambreno, Joseph
    MICROPROCESSORS AND MICROSYSTEMS, 2015, 39 (08) : 720 - 740
  • [46] Real-time Simulation of Dynamic Vehicle Models using a High-performance Reconfigurable Platform
    Monga, Madhu
    Karkee, Manoj
    Sun, Song
    Tondehal, Lakshmi Kiran
    Steward, Brian
    Kelkar, Atul
    Zambreno, Joseph
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE, ICCS 2012, 2012, 9 : 338 - 347
  • [47] An Instruction-Driven Batch-Based High-Performance Resource-Efficient LSTM Accelerator on FPGA
    Mao, Ning
    Yang, Haigang
    Huang, Zhihong
    ELECTRONICS, 2023, 12 (07)
  • [48] A High-Performance CNN-Applied HEVC Steganography Based on Diamond-Coded PU Partition Modes
    Liu, Jindou
    Li, Zhaohong
    Jiang, Xinghao
    Zhang, Zhenzhen
    IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 2084 - 2097
  • [49] Low-precision Floating-point Arithmetic for High-performance FPGA-based CNN Acceleration
    Wu, Chen
    Wang, Mingyu
    Chu, Xinyuan
    Wang, Kun
    He, Lei
    ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS, 2022, 15 (01)
  • [50] Hardware Accelerator Integration Tradeoffs for High-Performance Computing: A Case Study of GEMM Acceleration in N-Body Methods
    Asri, Mochamad
    Malhotra, Dhairya
    Wang, Jiajun
    Biros, George
    John, Lizy K.
    Gerstlauer, Andreas
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (08) : 2035 - 2048