Interstellar: Using Halide's Scheduling Language to Analyze DNN Accelerators

被引：148

作者：

Yang, Xuan ^{[1
]}

Gao, Mingyu ^{[2
]}

Liu, Qiaoyi ^{[1
]}

Setter, Jeff ^{[1
]}

Pu, Jing ^{[1
]}

Nayak, Ankita ^{[1
]}

Bell, Steven ^{[1
]}

Cao, Kaidi ^{[1
]}

Ha, Heonjae ^{[1
]}

Raina, Priyanka ^{[1
]}

Kozyrakis, Christos ^{[1
,3
]}

Horowitz, Mark ^{[1
]}

机构：

[1] Stanford Univ, Stanford, CA 94305 USA

[2] Tsinghua Univ, Beijing, Peoples R China

[3] Google, Mountain View, CA 94043 USA

来源：

TWENTY-FIFTH INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS (ASPLOS XXV) | 2020年

关键词：

neural networks; dataflow; domain specific language;

D O I：

10.1145/3373376.3378514

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

We show that DNN accelerator micro-architectures and their program mappings represent specific choices of loop order and hardware parallelism for computing the seven nested loops of DNNs, which enables us to create a formal taxonomy of all existing dense DNN accelerators. Surprisingly, the loop transformations needed to create these hardware variants can be precisely and concisely represented by Halide's scheduling language. By modifying the Halide compiler to generate hardware, we create a system that can fairly compare these prior accelerators. As long as proper loop blocking schemes are used, and the hardware can support mapping replicated loops, many different hardware dataflows yield similar energy efficiency with good performance. This is because the loop blocking can ensure that most data references stay on-chip with good locality and the processing units have high resource utilization. How resources are allocated, especially in the memory system, has a large impact on energy and performance. By optimizing hardware resource allocation while keeping throughput constant, we achieve up to 4.2x energy improvement for Convolutional Neural Networks (CNNs), 1.6x and 1.8x improvement for Long Short-Term Memories (LSTMs) and multi-layer perceptrons (MLPs), respectively.

引用

页码：369 / 383

页数：15

共 48 条

[1] Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing [J].

Albericio, Jorge ;

Judd, Patrick ;

Hetherington, Tayler ;

Aamodt, Tor ;

Jerger, Natalie Enright ;

Moshovos, Andreas .

2016 ACM/IEEE 43RD ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2016, :1-13

[2]

Alwani M, 2016, INT SYMP MICROARCH

[3]

[Anonymous], 2017, 44 ANN INT S COMPUTE

[4]

[Anonymous], 2014, ARXIV14091556

[5]

[Anonymous], 2014, CORR

[6]

[Anonymous], 2017, ARXIV170404861

[7]

[Anonymous], 2016, CORR

[8] DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning [J].

Chen, Tianshi ;

Du, Zidong ;

Sun, Ninghui ;

Wang, Jia ;

Wu, Chengyong ;

Chen, Yunji ;

Temam, Olivier .

ACM SIGPLAN NOTICES, 2014, 49 (04) :269-283

[9] Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices [J].

Chen, Yu-Hsin ;

Yange, Tien-Ju ;

Emer, Joel S. ;

Sze, Vivienne .

IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, 2019, 9 (02) :292-308

[10] Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks [J].

Chen, Yu-Hsin ;

Emer, Joel ;

Sze, Vivienne .

2016 ACM/IEEE 43RD ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2016, :367-379

← 1 2 3 4 5 →