Improving Data Reuse in NPU On-chip Memory with Interleaved Gradient Order for DNN Training

被引:0
|
作者
Kim, Jungwoo [1 ]
Na, Seonjin [1 ,2 ]
Lee, Sanghyeon [1 ]
Lee, Sunho [1 ]
Huh, Jaehyuk [1 ]
机构
[1] Korea Adv Inst Sci & Technol, Daejeon, South Korea
[2] Georgia Inst Technol, Atlanta, GA USA
关键词
DNN training; accelerators; on-chip memory; scheduling;
D O I
10.1145/3613424.3614299
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
During training tasks for machine learning models with neural processing units (NPUs), the most time-consuming part is the backward pass, which incurs significant overheads due to off-chip memory accesses. For NPUs, to mitigate the long latency and limited bandwidth of such off-chip DRAM accesses, the software-managed onchip scratchpad memory (SPM) plays a crucial role. As the backward pass computation must be optimized to improve the effectiveness of SPM, this study identifies a new data reuse pattern specific to the backward computation. The backward pass includes independent input and weight gradient computations sharing the same output gradient in each layer. Conventional sequential processing does not exploit the potential inter-operation data reuse opportunity within SPM. With this new opportunity of data reuse in the backward pass, this study proposes a novel data flow transformation scheme called interleaved gradient order, consisting of three techniques to enhance the utilization of NPU scratchpad memory. The first technique shuffles the input and weight gradient computations by interleaving two operations into a single fused operation to reduce redundant output gradient accesses. The second technique adjusts the tile access order for the interleaved gradient computations to maximize the potential data locality. However, since the best order is not fixed for all tensors, we propose a selection algorithm to find the most suitable order based on the tensor dimensions. The final technique further improves data reuse chances by using the best partitioning and mapping scheme for two gradient computations for single-core and multi-core NPUs. The simulation-based evaluation with single-core edge and server NPUs shows that the combined techniques can improve performance by 29.3% and 14.5% for edge and server NPUs respectively. Furthermore, with a quad-core server NPU, the proposed techniques reduce the execution time by 23.7%.
引用
收藏
页码:438 / 451
页数:14
相关论文
共 50 条
  • [1] Automatic on-chip memory minimization for data reuse
    Liu, Qiang
    Constantinides, George A.
    Masselos, Konstantinos
    Cheung, Peter Y. K.
    FCCM 2007: 15TH ANNUAL IEEE SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES, PROCEEDINGS, 2007, : 251 - +
  • [2] ClosNets: Batchless DNN Training with On-Chip A Priori Sparse Neural Topologies
    Isakov, Mihailo
    Ehret, Alan
    Kinsy, Michel A.
    2018 28TH INTERNATIONAL CONFERENCE ON FIELD PROGRAMMABLE LOGIC AND APPLICATIONS (FPL), 2018, : 55 - 59
  • [3] Accelerating DNN Training with Structured Data Gradient Pruning
    McDanel, Bradley
    Dinh, Helia
    Magallanes, John
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 2293 - 2299
  • [4] DNNOPT: A Framework for Efficiently Selecting On-chip Memory Loop Optimizations of DNN Accelerators
    Ranawaka, Piyumal
    Azhar, Muhammad Waqar
    Stenstrom, Per
    PROCEEDINGS OF THE 21ST ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS 2024, CF 2024, 2024, : 126 - 137
  • [5] Robustness to Variability and Asymmetry of In-Memory On-Chip Training
    Vartak, Rohit K.
    Saraswat, Vivek
    Ganguly, Udayan
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT IX, 2023, 14262 : 249 - 257
  • [6] Variability impact on on-chip memory data paths
    Amat, E.
    Calomarde, A.
    Canal, R.
    Rubio, A.
    2014 5TH EUROPEAN WORKSHOP ON CMOS VARIABILITY (VARI), 2014,
  • [7] QOC: Quantum On-Chip Training with Parameter Shift and Gradient Pruning
    Wang, Hanrui
    Li, Zirui
    Gu, Jiaqi
    Ding, Yongshan
    Pan, David Z.
    Han, Song
    PROCEEDINGS OF THE 59TH ACM/IEEE DESIGN AUTOMATION CONFERENCE, DAC 2022, 2022, : 655 - 660
  • [8] Data-Pattern-Based Predictive On-Chip Power Meter in DNN Accelerator
    Peng, Jian
    Liang, Tingyuan
    Jiang, Jingbo
    Zhang, Yipu
    Lin, Zhe
    Xie, Zhiyao
    Zhang, Wei
    IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2024, 43 (12) : 4753 - 4766
  • [9] Automatic Data Placement into GPU On-Chip Memory Resources
    Li, Chao
    Yang, Yi
    Lin, Zhen
    Zhou, Huiyang
    2015 IEEE/ACM INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION (CGO), 2015, : 23 - 33
  • [10] AN ON-CHIP SMART MEMORY FOR A DATA-FLOW CPU
    UVIEGHARA, GA
    NAKAGOME, Y
    JEONG, DK
    HODGES, DA
    IEEE JOURNAL OF SOLID-STATE CIRCUITS, 1990, 25 (01) : 84 - 94