A Small-Footprint Accelerator for Large-Scale Neural Networks

被引：3

作者：

Chen, Tianshi ^{[1
]}

Zhang, Shijin ^{[1
]}

Liu, Shaoli ^{[1
]}

Du, Zidong ^{[1
]}

Luo, Tao ^{[1
]}

Gao, Yuan ^{[2
]}

Liu, Junjie ^{[2
]}

Wang, Dongsheng ^{[2
]}

Wu, Chengyong ^{[1
]}

Sun, Ninghui ^{[1
]}

Chen, Yunji ^{[1
,4
]}

Temam, Olivier ^{[3
]}

机构：

[1] Chinese Acad Sci, ICT, SKLCA, Beijing 100190, Peoples R China

[2] Tsinghua Univ, TNLIST, Beijing 100084, Peoples R China

[3] Inria, Saclay, France

[4] Chinese Acad Sci, Ctr Excellence Brain Sci, Beijing 100190, Peoples R China

来源：

ACM TRANSACTIONS ON COMPUTER SYSTEMS | 2015年 / 33卷 / 02期

关键词：

Architecture; Processor; Hardware; RECOGNITION;

D O I：

10.1145/2701417

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Machine-learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural Networks, i.e., CNNs and DNNs) are proving to be state-of-the-art across many applications. As architectures evolve toward heterogeneous multicores composed of a mix of cores and accelerators, a machine-learning accelerator can achieve the rare combination of efficiency (due to the small number of target algorithms) and broad application scope. Until now, most machine-learning accelerator designs have been focusing on efficiently implementing the computational part of the algorithms. However, recent state-of-the-art CNNs and DNNs are characterized by their large size. In this study, we design an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance, and energy. We show that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s (key NN operations such as synaptic weight multiplications and neurons outputs additions) in a small footprint of 3.02 mm(2) and 485mW; compared to a 128-bit 2GHz SIMD processor, the accelerator is 117.87x faster, and it can reduce the total energy by 21.08x. The accelerator characteristics are obtained after layout at 65nm. Such a high throughput in a small footprint can open up the usage of state-of-the-art machine-learning algorithms in a broad set of systems and for a broad set of applications.

引用

页数：27

共 45 条

[1]

Al Maashri A, 2012, DES AUT CON, P579

[2]

Amant Renee St., 2008, INT S MICR COM

[3]

[Anonymous], 2012, P 17 C EL POW DISTR

[4]

[Anonymous], INT C INF KNOWL MAN

[5]

[Anonymous], INT S MICR

[6]

[Anonymous], 2012, Proceedings of the 29th International conference on machine learning (ICML-12)

[7]

[Anonymous], 2012, ARXIV

[8]

[Anonymous], 2011, COMPUTER VISION PATT

[9]

[Anonymous], 2013, IEEE INT C ACOUSTICS

[10]

[Anonymous], INT S COMP ARCH

← 1 2 3 4 5 →