XVDPU: A High-Performance CNN Accelerator on the Versal Platform Powered by the AI Engine

被引：2

作者：

Jia, Xijie ^{[1
]}

Zhang, Yu ^{[1
]}

Liu, Guangdong ^{[1
]}

Yang, Xinlin ^{[1
]}

Zhang, Tianyu ^{[1
]}

Zheng, Jia ^{[1
]}

Xu, Dongdong ^{[1
]}

Liu, Zhuohuan ^{[1
]}

Liu, Mengke ^{[1
]}

Yan, Xiaoyang ^{[1
]}

Wang, Hong ^{[1
]}

Zheng, Rongzhang ^{[1
]}

Wang, Li ^{[1
]}

Li, Dong ^{[1
]}

Pareek, Satyaprakash ^{[1
]}

Weng, Jian ^{[1
]}

Tian, Lu ^{[1
]}

Xie, Dongliang ^{[1
]}

Luo, Hong ^{[1
]}

Shan, Yi ^{[2
]}

机构：

[1] AMD, 15F Block B China Overseas Int Ctr,Bldg 5 5 Yard, Beijing 100029, Peoples R China

[2] PhiGent Robot, 25F,Tower B,Tsinghua Tongfang High Tech Plaza,1 W, Beijing 100083, Peoples R China

来源：

ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS | 2024年 / 17卷 / 02期

关键词：

ACAP; acceleration; AI Engine; ALU engine; CNN; FPGA; hardware heterogeneous architecture; Versal; IMAGE SUPERRESOLUTION;

D O I：

10.1145/3617836

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Today, convolutional neural networks (CNNs) are widely used in computer vision applications. However, the trends of higher accuracy and higher resolution generate larger networks. The requirements of computation or I/O are the key bottlenecks. In this article, we propose XVDPU: the AI Engine (AIE)-based CNN accelerator on Versal chips to meet heavy computation requirements. To resolve the IO bottleneck, we adopt several techniques to improve data reuse and reduce I/O requirements. An arithmetic logic unit is further proposed that can better balance resource utilization, new feature support, and efficiency of the whole system. We have successfully deployed more than 100 CNN models with our accelerator. Our experimental results show that the 96-AIE-core implementation can achieve 1,653 frames per second (FPS) for ResNet50 on VCK190, which is 9.8x faster than the design on ZCU102 running at 168.5 FPS. The 256-AIE-core implementation can further achieve 4,050 FPS. We propose a tilling strategy to achieve feature-map-stationary for high-definition CNN with the accelerator, achieving 3.8x FPS improvement on the residual channel attention network and 3.1x on super-efficient super-resolution. This accelerator can also solve the 3D convolution task in disparity estimation, achieving end-to-end performance of 10.1 FPS with all the optimizations.

引用

页数：24

共 50 条

[41] TopSort: A High-Performance Two-Phase Sorting Accelerator Optimized on HBM-Based FPGAs
Qiao, Weikang
Guo, Licheng
Fang, Zhenman
Chang, Mau-Chung Frank
Cong, Jason
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, 2023, 11 (02) : 404 - 419
[42] A real-time and high-performance MobileNet accelerator based on adaptive dataflow scheduling for image classification
Xiaoting Sang
Tao Ruan
Chunlei Li
Huanyu Li
Ruimin Yang
Zhoufeng Liu
Journal of Real-Time Image Processing, 2024, 21
[43] FNNG: A High-Performance FPGA-based Accelerator for K-Nearest Neighbor Graph Construction
Liu, Chaoqiang
Liu, Haifeng
Zheng, Long
Huang, Yu
Ye, Xiangyu
Liao, Xiaofei
Jin, Hai
PROCEEDINGS OF THE 2023 ACM/SIGDA INTERNATIONAL SYMPOSIUM ON FIELD PROGRAMMABLE GATE ARRAYS, FPGA 2023, 2023, : 67 - 77
[44] GoSPA: An Energy-efficient High-performance Globally Optimized SParse Convolutional Neural Network Accelerator
Deng, Chunhua
Sui, Yang
Liao, Siyu
Qian, Xuehai
Yuan, Bo
2021 ACM/IEEE 48TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2021), 2021, : 1110 - 1123
[45] Real-time simulation of dynamic vehicle models using a high-performance reconfigurable platform
Monga, Madhu
Roggow, Daniel
Karkee, Manoj
Sun, Song
Tondehal, Lakshmi Kiran
Steward, Brian
Kelkar, Atul
Zambreno, Joseph
MICROPROCESSORS AND MICROSYSTEMS, 2015, 39 (08) : 720 - 740
[46] Real-time Simulation of Dynamic Vehicle Models using a High-performance Reconfigurable Platform
Monga, Madhu
Karkee, Manoj
Sun, Song
Tondehal, Lakshmi Kiran
Steward, Brian
Kelkar, Atul
Zambreno, Joseph
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE, ICCS 2012, 2012, 9 : 338 - 347
[47] An Instruction-Driven Batch-Based High-Performance Resource-Efficient LSTM Accelerator on FPGA
Mao, Ning
Yang, Haigang
Huang, Zhihong
ELECTRONICS, 2023, 12 (07)
[48] A High-Performance CNN-Applied HEVC Steganography Based on Diamond-Coded PU Partition Modes
Liu, Jindou
Li, Zhaohong
Jiang, Xinghao
Zhang, Zhenzhen
IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 2084 - 2097
[49] Low-precision Floating-point Arithmetic for High-performance FPGA-based CNN Acceleration
Wu, Chen
Wang, Mingyu
Chu, Xinyuan
Wang, Kun
He, Lei
ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS, 2022, 15 (01)
[50] Hardware Accelerator Integration Tradeoffs for High-Performance Computing: A Case Study of GEMM Acceleration in N-Body Methods
Asri, Mochamad
Malhotra, Dhairya
Wang, Jiajun
Biros, George
John, Lizy K.
Gerstlauer, Andreas
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (08) : 2035 - 2048

← 1 2 3 4 5 →