Achieving Super-Linear Speedup across Multi-FPGA for Real-Time DNN Inference

被引:53
作者
Jiang, Weiwen [1 ,2 ,3 ]
Sha, Edwin H-M [1 ]
Zhang, Xinyi [2 ]
Yang, Lei [2 ]
Zhuge, Qingfeng [1 ]
Shi, Yiyu [3 ]
Hu, Jingtong [2 ]
机构
[1] East China Normal Univ, Shanghai, Peoples R China
[2] Univ Pittsburgh, Pittsburgh, PA 15260 USA
[3] Univ Notre Dame, Notre Dame, IN 46556 USA
基金
中国国家自然科学基金; 美国国家科学基金会;
关键词
FPGA; DNN inference; real-time; parallel computing;
D O I
10.1145/3358192
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Real-time Deep Neural Network (DNN) inference with low-latency requirement has become increasingly important for numerous applications in both cloud computing (e.g., Apple's Siri) and edge computing (e.g., Google/Waymo's driverless car). FPGA-based DNN accelerators have demonstrated both superior flexibility and performance; in addition, for real-time inference with low batch size, FPGA is expected to achieve further performance improvement. However, the performance gain from the single-FPGA design is obstructed by the limited on-chip resource. In this paper, we employ multiple FPGAs to cooperatively run DNNs with the objective of achieving super-linear speed-up against single-FPGA design. In implementing such systems, we found two barriers that hinder us from achieving the design goal: (1) the lack of a clear partition scheme for each DNN layer to fully exploit parallelism, and (2) the insufficient bandwidth between the off-chip memory and the accelerator due to the growing size of DNNs. To tackle these issues, we propose a general framework, "Super-LIP", which can support different kinds of DNNs. In this paper, we take Convolutional Neural Network (CNN) as a vehicle to illustrate Super-LIP. We first formulate an accurate system-level model to support the exploration of best partition schemes. Then, we develop a novel design methodology to effectively alleviate the heavy loads on memory bandwidth by moving traffic from memory bus to inter-FPGA links. We implement Super-LIP based on ZCU102 FPGA boards. Results demonstrate that Super-LIP with 2 FPGAs can achieve 3.48x speedup, compared to the state-of-the-art single-FPGA design. What is more, as the number of FPGAs scales up, the system latency can be further reduced while maintaining high energy efficiency.
引用
收藏
页数:23
相关论文
共 37 条
[1]  
[Anonymous], P 56 ANN DES AUT C 2
[2]  
[Anonymous], 2018, ARXIV180203646
[3]  
[Anonymous], IEEE T COMPUT
[4]  
[Anonymous], ADV NEURAL INFORM PR, DOI DOI 10.1109/TPAMI.2016.2577031
[5]  
[Anonymous], ADV NEURAL INFORM PR
[6]  
[Anonymous], 2019, ARXIV PREPRINT ARXIV
[7]  
[Anonymous], 2019, P FPGA
[8]  
[Anonymous], 2019, P 56 ANN DES AUT C
[9]   An Unsupervised Learning Model for Deformable Medical Image Registration [J].
Balakrishnan, Guha ;
Zhao, Amy ;
Sabuncu, Mert R. ;
Guttag, John ;
Dalca, Adrian V. .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :9252-9260
[10]   A Configurable Cloud-Scale DNN Processor for Real-Time AI [J].
Fowers, Jeremy ;
Ovtcharov, Kalin ;
Papamichael, Michael ;
Massengill, Todd ;
Liu, Ming ;
Lo, Daniel ;
Alkalay, Shlomi ;
Haselman, Michael ;
Adams, Logan ;
Ghandi, Mahdi ;
Heil, Stephen ;
Patel, Prerak ;
Sapek, Adam ;
Weisz, Gabriel ;
Woods, Lisa ;
Lanka, Sitaram ;
Reinhardt, Steven K. ;
Caulfield, Adrian M. ;
Chung, Eric S. ;
Burger, Doug .
2018 ACM/IEEE 45TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2018, :1-14