Pin or Fuse? Exploiting Scratchpad Memory to Reduce Off-Chip Data Transfer in DNN Accelerators

被引:7
作者
Jeong, Hyuk-Jin [1 ]
Yeo, JiHwan [1 ]
Bahk, Cheongyo [1 ]
Park, JongHyun [1 ]
机构
[1] Samsung Res, Seoul, South Korea
来源
PROCEEDINGS OF THE 21ST ACM/IEEE INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION, CGO 2023 | 2023年
关键词
neural networks; accelerator; compiler; NETWORK;
D O I
10.1145/3579990.3580017
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Growing interests in on-device AI have led to the proliferation of accelerators dedicated to neural network inference. Most ASIC accelerators are equipped with compiler-controlled scratchpad memory (SPM) used as a last-level cache to reduce the number of accesses to off-chip memory. A widely-used strategy for utilizing SPM is fused-layer execution, which divides a DNN model into groups of layers and forwards the intermediate results within each group without eviction to the off-chip memory. However, layer fusion has an inherent limitation that the fusion of consecutive layers increases the amount of computations, leading to sub-optimal performance. This paper introduces a new dimension to SPM usage, which temporarily pins a feature map on SPM. Pinning reduces off-chip transfer without computation increase, but it is not applicable to all feature maps due to limited SPM size. We find that superior performance can be achieved by combination of pinning and fusion in MobileNet. Based on this observation, we propose a model-level optimization method that jointly applies pinning and fusion to minimize inference latency under memory constraints. Scheduling and allocation schemes are presented for automatic generation of optimized codes. Evaluation on the commercial AI accelerator shows that the proposed method reduces off-chip transfer of feature maps by 50% and improves inference latency by 15% on average without additional hardware, compared to the state-of-the-art fusion approach.
引用
收藏
页码:224 / 235
页数:12
相关论文
共 45 条
  • [1] Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
  • [2] Efficient Memory Pool Allocation Algorithm for CNN Inference
    Abraham, Arun
    Sahni, Manas
    Parashar, Akshay
    [J]. 2019 IEEE 26TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS (HIPC), 2019, : 345 - 352
  • [3] Adegbija T., 2015, P 25 EDITION GREAT L, P115
  • [4] Alwani M, 2016, INT SYMP MICROARCH
  • [5] ORTHOGONAL PACKINGS IN 2 DIMENSIONS
    BAKER, BS
    COFFMAN, EG
    RIVEST, RL
    [J]. SIAM JOURNAL ON COMPUTING, 1980, 9 (04) : 846 - 855
  • [6] Optimus: Towards Optimal Layer-Fusion on Deep Learning Processors
    Cai, Xuyi
    Wang, Ying
    Zhang, Lei
    [J]. LCTES '21: PROCEEDINGS OF THE 22ND ACM SIGPLAN/SIGBED INTERNATIONAL CONFERENCE ON LANGUAGES, COMPILERS, AND TOOLS FOR EMBEDDED SYSTEMS, 2021, : 67 - 79
  • [7] Chen TQ, 2018, PROCEEDINGS OF THE 13TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P579
  • [8] DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning
    Chen, Tianshi
    Du, Zidong
    Sun, Ninghui
    Wang, Jia
    Wu, Chengyong
    Chen, Yunji
    Temam, Olivier
    [J]. ACM SIGPLAN NOTICES, 2014, 49 (04) : 269 - 283
  • [9] Cascaded Pyramid Network for Multi-Person Pose Estimation
    Chen, Yilun
    Wang, Zhicheng
    Peng, Yuxiang
    Zhang, Zhiqiang
    Yu, Gang
    Sun, Jian
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7103 - 7112
  • [10] A Survey of Accelerator Architectures for Deep Neural Networks
    Chen, Yiran
    Xie, Yuan
    Song, Linghao
    Chen, Fan
    Tang, Tianqi
    [J]. ENGINEERING, 2020, 6 (03) : 264 - 274