SynergyFlow: An Elastic Accelerator Architecture Supporting Batch Processing of Large-Scale Deep Neural Networks

被引：1

作者：

Li, Jiajun ^{[1
,2
]}

Yan, Guihai ^{[1
,2
]}

Lu, Wenyan ^{[1
,2
]}

Gong, Shijun ^{[1
,2
]}

Jiang, Shuhao ^{[1
,2
]}

Wu, Jingya ^{[1
,2
]}

Li, Xiaowei ^{[1
,2
]}

机构：

[1] Chinese Acad Sci, Inst Comp Technol, State Key Lab Comp Architecture, 6 Kexueyuan South Rd, Beijing 100190, Peoples R China

[2] Univ Chinese Acad Sci, Beijing, Peoples R China

来源：

ACM TRANSACTIONS ON DESIGN AUTOMATION OF ELECTRONIC SYSTEMS | 2019年 / 24卷 / 01期

基金：

中国国家自然科学基金;

关键词：

Deep neural networks; convolutional neural networks; accelerator; architecture; resource utilization; complementary effect; COPROCESSOR; HARDWARE;

D O I：

10.1145/3275243

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Neural networks (NNs) have achieved great success in a broad range of applications. As NN-based methods are often both computation and memory intensive, accelerator solutions have been proved to be highly promising in terms of both performance and energy efficiency. Although prior solutions can deliver high computational throughput for convolutional layers, they could incur severe performance degradation when accommodating the entire network model, because there exist very diverse computing and memory bandwidth requirements between convolutional layers and fully connected layers and, furthermore, among different NN models. To overcome this problem, we proposed an elastic accelerator architecture, called SynergyFlow, which intrinsically supports layer-level and model-level parallelism for large-scale deep neural networks. SynergyFlow boosts the resource utilization by exploiting the complementary effect of resource demanding in different layers and different NN models. SynergyFlow can dynamically reconfigure itself according to the workload characteristics, maintaining a high performance and high resource utilization among various models. As a case study, we implement SynergyFlow on a P395-AB FPGA board. Under 100MHz working frequency, our implementation improves the performance by 33.8% on average (up to 67.2% on AlexNet) compared to comparable provisioned previous architectures.

引用

页数：27

共 54 条

[1] Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing [J].

Albericio, Jorge ;

Judd, Patrick ;

Hetherington, Tayler ;

Aamodt, Tor ;

Jerger, Natalie Enright ;

Moshovos, Andreas .

2016 ACM/IEEE 43RD ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2016, :1-13

[2]

Alwani M., 2016, P 49 ANN IEEE ACM IN, P1, DOI DOI 10.1109/MICRO.2016.7783725

[3]

[Anonymous], P 3 INT C LEARNING R

[4]

[Anonymous], 2013, IEEE T PATTERN ANAL, DOI DOI 10.1109/TPAMI.2012.59

[5]

[Anonymous], P DEEP LEARN UNS FEA

[6]

[Anonymous], 2016, The 49th Annual IEEE/ACM International Symposium on Microarchitecture, page, DOI DOI 10.1109/MICRO.2016.7783723

[7]

[Anonymous], 2009, COMMUNICATIONS ASS C

[8]

[Anonymous], 2007, IEEE INT C ICML

[9]

[Anonymous], 2011, P 3 USENIX C HOT TOP

[10] Representation Learning: A Review and New Perspectives [J].

Bengio, Yoshua ;

Courville, Aaron ;

Vincent, Pascal .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (08) :1798-1828

← 1 2 3 4 5 6 →