Achieving Super-Linear Speedup across Multi-FPGA for Real-Time DNN Inference

被引：53

作者：

Jiang, Weiwen ^{[1
,2
,3
]}

Sha, Edwin H-M ^{[1
]}

Zhang, Xinyi ^{[2
]}

Yang, Lei ^{[2
]}

Zhuge, Qingfeng ^{[1
]}

Shi, Yiyu ^{[3
]}

Hu, Jingtong ^{[2
]}

机构：

[1] East China Normal Univ, Shanghai, Peoples R China

[2] Univ Pittsburgh, Pittsburgh, PA 15260 USA

[3] Univ Notre Dame, Notre Dame, IN 46556 USA

来源：

ACM TRANSACTIONS ON EMBEDDED COMPUTING SYSTEMS | 2019年 / 18卷 / 05期

基金：

中国国家自然科学基金; 美国国家科学基金会;

关键词：

FPGA; DNN inference; real-time; parallel computing;

D O I：

10.1145/3358192

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Real-time Deep Neural Network (DNN) inference with low-latency requirement has become increasingly important for numerous applications in both cloud computing (e.g., Apple's Siri) and edge computing (e.g., Google/Waymo's driverless car). FPGA-based DNN accelerators have demonstrated both superior flexibility and performance; in addition, for real-time inference with low batch size, FPGA is expected to achieve further performance improvement. However, the performance gain from the single-FPGA design is obstructed by the limited on-chip resource. In this paper, we employ multiple FPGAs to cooperatively run DNNs with the objective of achieving super-linear speed-up against single-FPGA design. In implementing such systems, we found two barriers that hinder us from achieving the design goal: (1) the lack of a clear partition scheme for each DNN layer to fully exploit parallelism, and (2) the insufficient bandwidth between the off-chip memory and the accelerator due to the growing size of DNNs. To tackle these issues, we propose a general framework, "Super-LIP", which can support different kinds of DNNs. In this paper, we take Convolutional Neural Network (CNN) as a vehicle to illustrate Super-LIP. We first formulate an accurate system-level model to support the exploration of best partition schemes. Then, we develop a novel design methodology to effectively alleviate the heavy loads on memory bandwidth by moving traffic from memory bus to inter-FPGA links. We implement Super-LIP based on ZCU102 FPGA boards. Results demonstrate that Super-LIP with 2 FPGAs can achieve 3.48x speedup, compared to the state-of-the-art single-FPGA design. What is more, as the number of FPGAs scales up, the system latency can be further reduced while maintaining high energy efficiency.

引用

页数：23

共 37 条

[1]

[Anonymous], P 56 ANN DES AUT C 2

[2]

[Anonymous], 2018, ARXIV180203646

[3]

[Anonymous], IEEE T COMPUT

[4]

[Anonymous], ADV NEURAL INFORM PR, DOI DOI 10.1109/TPAMI.2016.2577031

[5]

[Anonymous], ADV NEURAL INFORM PR

[6]

[Anonymous], 2019, ARXIV PREPRINT ARXIV

[7]

[Anonymous], 2019, P FPGA

[8]

[Anonymous], 2019, P 56 ANN DES AUT C

[9] An Unsupervised Learning Model for Deformable Medical Image Registration [J].

Balakrishnan, Guha ;

Zhao, Amy ;

Sabuncu, Mert R. ;

Guttag, John ;

Dalca, Adrian V. .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :9252-9260

[10] A Configurable Cloud-Scale DNN Processor for Real-Time AI [J].

Fowers, Jeremy ;

Ovtcharov, Kalin ;

Papamichael, Michael ;

Massengill, Todd ;

Liu, Ming ;

Lo, Daniel ;

Alkalay, Shlomi ;

Haselman, Michael ;

Adams, Logan ;

Ghandi, Mahdi ;

Heil, Stephen ;

Patel, Prerak ;

Sapek, Adam ;

Weisz, Gabriel ;

Woods, Lisa ;

Lanka, Sitaram ;

Reinhardt, Steven K. ;

Caulfield, Adrian M. ;

Chung, Eric S. ;

Burger, Doug .

2018 ACM/IEEE 45TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2018, :1-14

← 1 2 3 4 →