Improving Data Reuse in NPU On-chip Memory with Interleaved Gradient Order for DNN Training

被引:0
|
作者
Kim, Jungwoo [1 ]
Na, Seonjin [1 ,2 ]
Lee, Sanghyeon [1 ]
Lee, Sunho [1 ]
Huh, Jaehyuk [1 ]
机构
[1] Korea Adv Inst Sci & Technol, Daejeon, South Korea
[2] Georgia Inst Technol, Atlanta, GA USA
关键词
DNN training; accelerators; on-chip memory; scheduling;
D O I
10.1145/3613424.3614299
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
During training tasks for machine learning models with neural processing units (NPUs), the most time-consuming part is the backward pass, which incurs significant overheads due to off-chip memory accesses. For NPUs, to mitigate the long latency and limited bandwidth of such off-chip DRAM accesses, the software-managed onchip scratchpad memory (SPM) plays a crucial role. As the backward pass computation must be optimized to improve the effectiveness of SPM, this study identifies a new data reuse pattern specific to the backward computation. The backward pass includes independent input and weight gradient computations sharing the same output gradient in each layer. Conventional sequential processing does not exploit the potential inter-operation data reuse opportunity within SPM. With this new opportunity of data reuse in the backward pass, this study proposes a novel data flow transformation scheme called interleaved gradient order, consisting of three techniques to enhance the utilization of NPU scratchpad memory. The first technique shuffles the input and weight gradient computations by interleaving two operations into a single fused operation to reduce redundant output gradient accesses. The second technique adjusts the tile access order for the interleaved gradient computations to maximize the potential data locality. However, since the best order is not fixed for all tensors, we propose a selection algorithm to find the most suitable order based on the tensor dimensions. The final technique further improves data reuse chances by using the best partitioning and mapping scheme for two gradient computations for single-core and multi-core NPUs. The simulation-based evaluation with single-core edge and server NPUs shows that the combined techniques can improve performance by 29.3% and 14.5% for edge and server NPUs respectively. Furthermore, with a quad-core server NPU, the proposed techniques reduce the execution time by 23.7%.
引用
收藏
页码:438 / 451
页数:14
相关论文
共 50 条
  • [41] A Layer-wise Training and Pruning Method for Memory Efficient On-chip Learning Hardware
    Lew, Dongwoo
    Park, Jongsun
    2022 19TH INTERNATIONAL SOC DESIGN CONFERENCE (ISOCC), 2022, : 97 - 98
  • [42] CIMAT: A Compute-In-Memory Architecture for On-chip Training Based on Transpose SRAM Arrays
    Jiang, Hongwu
    Peng, Xiaochen
    Huang, Shanshi
    Yu, Shimeng
    IEEE TRANSACTIONS ON COMPUTERS, 2020, 69 (07) : 944 - 954
  • [43] Direct Gradient Calculation: Simple and Variation-Tolerant On-Chip Training Method for Neural Networks
    Kim, Hyungyo
    Hwang, Joon
    Kwon, Dongseok
    Kim, Jangsaeng
    Park, Min-Kyu
    Im, Jiseong
    Park, Byung-Gook
    Lee, Jong-Ho
    ADVANCED INTELLIGENT SYSTEMS, 2021, 3 (08)
  • [44] The Organization of On-Chip Data Memory in One Coarse-Grained Reconfigurable Architecture
    Wang, Yansheng
    Liu, Leibo
    Yin, Shouyi
    Zhu, Min
    Cao, Peng
    Yang, Jun
    Wei, Shaojun
    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, 2013, E96A (11) : 2218 - 2229
  • [45] Multiprocessor System-on-Chip data reuse analysis for exploring customized memory hierarchies
    Issenin, Ilya
    Brockmeyer, Erik
    Durinck, Bart
    Dutt, Nikil
    43RD DESIGN AUTOMATION CONFERENCE, PROCEEDINGS 2006, 2006, : 49 - +
  • [46] Pin or Fuse? Exploiting Scratchpad Memory to Reduce Off-Chip Data Transfer in DNN Accelerators
    Jeong, Hyuk-Jin
    Yeo, JiHwan
    Bahk, Cheongyo
    Park, JongHyun
    PROCEEDINGS OF THE 21ST ACM/IEEE INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION, CGO 2023, 2023, : 224 - 235
  • [47] A New Insight and Modeling of Pulse-to-Pulse Variability in Analog Resistive Memory for On-Chip Training
    Yu, Zhizhen
    Wang, Zongwei
    Bao, Shengyu
    Ling, Yaotian
    Cai, Yimao
    Huang, Ru
    IEEE TRANSACTIONS ON ELECTRON DEVICES, 2022, 69 (06) : 3100 - 3104
  • [48] An On-chip Layer-wise Training Method for RRAM based Computing-in-memory Chips
    Geng, Yiwen
    Gao, Bin
    Zhang, Qingtian
    Zhang, Wenqiang
    Yao, Peng
    Xi, Yue
    Lin, Yudeng
    Chen, Junren
    Tang, Jianshi
    Wu, Huaqiang
    Qian, He
    PROCEEDINGS OF THE 2021 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE 2021), 2021, : 248 - 251
  • [49] Loop restructuring for data I/O minimization on limited on-chip memory embedded processors
    Tembe, W
    Pande, S
    IEEE TRANSACTIONS ON COMPUTERS, 2002, 51 (10) : 1269 - 1280
  • [50] An On-Chip-Training Keyword-Spotting Chip Using Interleaved Pipeline and Computation-in-Memory Cluster in 28-nm CMOS
    Qian, Junyi
    Li, Cai
    Chen, Long
    Li, Ruidong
    Li, Tuo
    Cao, Peng
    Si, Xin
    Shan, Weiwei
    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2025,