Pin or Fuse? Exploiting Scratchpad Memory to Reduce Off-Chip Data Transfer in DNN Accelerators

被引：7

作者：

Jeong, Hyuk-Jin ^{[1
]}

Yeo, JiHwan ^{[1
]}

Bahk, Cheongyo ^{[1
]}

Park, JongHyun ^{[1
]}

机构：

[1] Samsung Res, Seoul, South Korea

来源：

PROCEEDINGS OF THE 21ST ACM/IEEE INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION, CGO 2023 | 2023年

关键词：

neural networks; accelerator; compiler; NETWORK;

D O I：

10.1145/3579990.3580017

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Growing interests in on-device AI have led to the proliferation of accelerators dedicated to neural network inference. Most ASIC accelerators are equipped with compiler-controlled scratchpad memory (SPM) used as a last-level cache to reduce the number of accesses to off-chip memory. A widely-used strategy for utilizing SPM is fused-layer execution, which divides a DNN model into groups of layers and forwards the intermediate results within each group without eviction to the off-chip memory. However, layer fusion has an inherent limitation that the fusion of consecutive layers increases the amount of computations, leading to sub-optimal performance. This paper introduces a new dimension to SPM usage, which temporarily pins a feature map on SPM. Pinning reduces off-chip transfer without computation increase, but it is not applicable to all feature maps due to limited SPM size. We find that superior performance can be achieved by combination of pinning and fusion in MobileNet. Based on this observation, we propose a model-level optimization method that jointly applies pinning and fusion to minimize inference latency under memory constraints. Scheduling and allocation schemes are presented for automatic generation of optimized codes. Evaluation on the commercial AI accelerator shows that the proposed method reduces off-chip transfer of feature maps by 50% and improves inference latency by 15% on average without additional hardware, compared to the state-of-the-art fusion approach.

引用

页码：224 / 235

页数：12

共 45 条

[21] PerDNN: Offloading Deep Neural Network Computations to Pervasive Edge Servers
Jeong, Hyuk-Jin
Lee, Hyeon-Jae
Shin, Kwang Yong
Yoo, Yong Hwan
Moon, Soo-Mook
[J]. 2020 IEEE 40TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS), 2020, : 1055 - 1066
[22] Neurosurgeon: Collaborative Intelligence Between the Cloud and Mobile Edge
Kang, Yiping
Hauswald, Johann
Gao, Cao
Rovinski, Austin
Mudge, Trevor
Mars, Jason
Tang, Lingjia
[J]. ACM SIGPLAN NOTICES, 2017, 52 (04) : 615 - 629
[23] Li L, 2005, PACT 2005: 14TH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, P329
[24] Microsoft, The Future of Computing: DomainSpecific Accelerators
[25] Mullapudi RT, 2015, ACM SIGPLAN NOTICES, V50, P429, DOI [10.1145/2775054.2694364, 10.1145/2694344.2694364]
[26] NVIDIA, NVDLA Documentation
[27] Peemen M, 2013, 2013 IEEE 31ST INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD), P13, DOI 10.1109/ICCD.2013.6657019
[28] Qualcomm, Qualcomm AI and AIE Products
[29] Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines
Ragan-Kelley, Jonathan
Barnes, Connelly
Adams, Andrew
Paris, Sylvain
Durand, Fredo
Amarasinghe, Saman
[J]. ACM SIGPLAN NOTICES, 2013, 48 (06) : 519 - 530
[30] Samsung, Samsung Brings On-device AI Processing for Premium Mobile Devices with Exynos 9 Series 9820 Processor

← 1 2 3 4 5 →