PL-NPU: An Energy-Efficient Edge-Device DNN Training Processor With Posit-Based Logarithm-Domain Computing

被引:12
作者
Wang, Yang [1 ,2 ]
Deng, Dazheng [1 ,2 ]
Liu, Leibo [1 ,2 ]
Wei, Shaojun [1 ,2 ]
Yin, Shouyi [1 ,2 ]
机构
[1] Tsinghua Univ, Sch Integrated Circuits, Beijing Innovat Ctr Future Chip, Beijing 100084, Peoples R China
[2] Tsinghua Univ, Sch Integrated Circuits, Beijing Natl Res Ctr Informat Sci & Technol, Beijing 100084, Peoples R China
关键词
DNN training processor; edge-devices; reconfigurable dataflow; posit; logarithm-domain computing; ACCELERATOR;
D O I
10.1109/TCSI.2022.3184115
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Edge device deep neural network (DNN) training is practical to improve model adaptivity for unfamiliar datasets while avoiding privacy disclosure and huge communication cost. Nevertheless, apart from feed-forward (FF) as inference, DNN training still requires back-propagation (BP) and weight gradient (WG), introducing power-consuming floating-point computing requirements, hardware underutilization, and energy bottleneck from excessive memory access. This paper proposes a DNN training processor named PL-NPU to solve the above challenges with three innovations. First, a posit-based logarithmdomain processing element (PE) adapts to various training data requirements with a low bit-width format and reduces energy by transferring complicated arithmetics into simple logarithm domain operation. Second, a reconfigurable inter-intra-channelreuse dataflow dynamically adjusts the PE mapping with a regrouping omega network to improve the operands reuse for higher hardware utilization. Third, a pointed-stake-shaped codec unit adaptively compresses small values to variable-length data format while compressing large values to fixed-length 8b posit format, reducing the memory access for breaking the training energy bottleneck. Simulated with 28nm CMOS technology, the proposed PL-NPU achieves a maximum frequency of 1040MHz with 343mW and 5.28mm(2). The peak energy efficiency is 3.87TFLOPS/W for 0.6V at 60MHz. Compared with the state-of-the-art training processor, PL-NPU reaches 3.75x higher energy efficiency and offers 1.68x speedup when training ResNet18.
引用
收藏
页码:4042 / 4055
页数:14
相关论文
共 47 条
[1]   A 7nm 4-Core AI Chip with 25.6TFLOPS Hybrid FP8 Training, 102.4TOPS INT4 Inference and Workload-Aware Throttling [J].
Agrawal, Ankur ;
Lee, Sae Kyu ;
Silberman, Joel ;
Ziegler, Matthew ;
Kang, Mingu ;
Venkataramani, Swagath ;
Cao, Nianzheng ;
Fleischer, Bruce ;
Guillorn, Michael ;
Cohen, Matthew ;
Mueller, Silvia ;
Oh, Jinwook ;
Lutz, Martin ;
Jung, Jinwook ;
Koswatta, Siyu ;
Zhou, Ching ;
Zalani, Vidhi ;
Bonanno, James ;
Casatuta, Robert ;
Chen, Chia-Yu ;
Choi, Jungwook ;
Haynie, Howard ;
Herbert, Alyssa ;
Jain, Radhika ;
Kar, Monodeep ;
Kim, Kyu-Hyoun ;
Li, Yulong ;
Ren, Zhibin ;
Rider, Scot ;
Schaal, Marcel ;
Schelm, Kerstin ;
Scheuermann, Michael ;
Sun, Xiao ;
Tran, Hung ;
Wang, Naigang ;
Wang, Wei ;
Zhang, Xin ;
Shah, Vinay ;
Curran, Brian ;
Srinivasan, Vijayalakshmi ;
Lu, Pong-Fei ;
Shukla, Sunil ;
Chang, Leland ;
Gopalakrishnan, Kailash .
2021 IEEE INTERNATIONAL SOLID-STATE CIRCUITS CONFERENCE (ISSCC), 2021, 64 :144-+
[2]  
[Anonymous], DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING
[3]  
Cambier L., 2020, PROC INT C LEARN REP
[4]   Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices [J].
Chen, Yu-Hsin ;
Yange, Tien-Ju ;
Emer, Joel S. ;
Sze, Vivienne .
IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, 2019, 9 (02) :292-308
[5]   A Deep Neural Network Training Architecture With Inference-Aware Heterogeneous Data-Type [J].
Choi, Seungkyu ;
Shin, Jaekang ;
Kim, Lee-Sup .
IEEE TRANSACTIONS ON COMPUTERS, 2022, 71 (05) :1216-1229
[6]   TrainWare: A Memory Optimized Weight Update Architecture for On-Device Convolutional Neural Network Training [J].
Choi, Seungkyu ;
Sim, Jaehyeong ;
Kang, Myeonggu ;
Kim, Lee-Sup .
PROCEEDINGS OF THE INTERNATIONAL SYMPOSIUM ON LOW POWER ELECTRONICS AND DESIGN (ISLPED '18), 2018, :104-109
[7]   An Energy-Efficient Deep Convolutional Neural Network Training Accelerator for In Situ Personalization on Smart Devices [J].
Choi, Seungkyu ;
Sim, Jaehyeong ;
Kang, Myeonggu ;
Choi, Yeongjae ;
Kim, Hyeonuk ;
Kim, Lee-Sup .
IEEE JOURNAL OF SOLID-STATE CIRCUITS, 2020, 55 (10) :2691-2702
[8]  
Choquette W., 2020, 2020 IEEE HOT CHIPS, P1
[9]   Arithmetic on the European logarithmic microprocessor [J].
Coleman, JN ;
Chester, EI ;
Softley, CI ;
Kadlec, J .
IEEE TRANSACTIONS ON COMPUTERS, 2000, 49 (07) :702-715
[10]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848