A Dynamic Execution Neural Network Processor for Fine-Grained Mixed-Precision Model Training Based on Online Quantization Sensitivity Analysis

被引:1
作者
Liu, Ruoyang [1 ]
Wei, Chenhan [1 ]
Yang, Yixiong [2 ]
Wang, Wenxun [1 ]
Yuan, Binbin [3 ]
Yang, Huazhong [1 ]
Liu, Yongpan [1 ]
机构
[1] Tsinghua Univ, Dept Elect Engn, Beijing 100084, Peoples R China
[2] Nvidia, Shanghai 201210, Peoples R China
[3] Traff Control Technol Co Ltd, Beijing 100160, Peoples R China
关键词
Training; Artificial neural networks; Quantization (signal); Process control; Tensors; System-on-chip; Memory management; Dynamic precision (DP); fully quantized network training; low-bit training; mixed-precision quantization; neural network (NN) training accelerator;
D O I
10.1109/JSSC.2024.3377292
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
As neural network (NN) training cost red has been growing exponentially over the past decade, developing high-speed and energy-efficient training methods has become an urgent task. Fine-grained mixed-precision low-bit training is the most promising way for high-efficiency training, but it needs dedicated processor designs to overcome the overhead in control, storage, and I/O and remove the power bottleneck in floating-point (FP) units. This article presents a dynamic execution NN processor supporting fine-grained mixed-precision training through an online quantization sensitivity analysis. Three key features are proposed: the quantization-sensitivity-aware dynamic execution controller, dynamic bit-width adaptive datapath design, and the low-power multi-level-aligned block-FP unit (BFPU). This chip achieves 13.2-TFLOPS/W energy efficiency and 1.07-TFLOPS/mm(2) area efficiency.
引用
收藏
页码:3082 / 3093
页数:12
相关论文
共 37 条
[1]   A 7nm 4-Core AI Chip with 25.6TFLOPS Hybrid FP8 Training, 102.4TOPS INT4 Inference and Workload-Aware Throttling [J].
Agrawal, Ankur ;
Lee, Sae Kyu ;
Silberman, Joel ;
Ziegler, Matthew ;
Kang, Mingu ;
Venkataramani, Swagath ;
Cao, Nianzheng ;
Fleischer, Bruce ;
Guillorn, Michael ;
Cohen, Matthew ;
Mueller, Silvia ;
Oh, Jinwook ;
Lutz, Martin ;
Jung, Jinwook ;
Koswatta, Siyu ;
Zhou, Ching ;
Zalani, Vidhi ;
Bonanno, James ;
Casatuta, Robert ;
Chen, Chia-Yu ;
Choi, Jungwook ;
Haynie, Howard ;
Herbert, Alyssa ;
Jain, Radhika ;
Kar, Monodeep ;
Kim, Kyu-Hyoun ;
Li, Yulong ;
Ren, Zhibin ;
Rider, Scot ;
Schaal, Marcel ;
Schelm, Kerstin ;
Scheuermann, Michael ;
Sun, Xiao ;
Tran, Hung ;
Wang, Naigang ;
Wang, Wei ;
Zhang, Xin ;
Shah, Vinay ;
Curran, Brian ;
Srinivasan, Vijayalakshmi ;
Lu, Pong-Fei ;
Shukla, Sunil ;
Chang, Leland ;
Gopalakrishnan, Kailash .
2021 IEEE INTERNATIONAL SOLID-STATE CIRCUITS CONFERENCE (ISSCC), 2021, 64 :144-+
[2]  
Banner R, 2019, ADV NEUR IN, V32
[3]  
Brown TB, 2020, ADV NEUR IN, V33
[4]  
Chen Jianfei, 2021, P MACHINE LEARNING R, V139
[5]  
Cottier Ben, 2023, Epoch AI blog
[6]  
DDR3 SDRAM Standard, 2012, 793F JEDEC
[7]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[8]  
Fu Zih-Sing, 2022, 2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits), P40, DOI 10.1109/VLSITechnologyandCir46769.2022.9830487
[9]  
Fujiwara H., 2022, IEEE INT SOLID STATE, V65, P1, DOI [DOI 10.1109/ISSCC42614.2022.9731754, 10.1109/ISSCC42614.2022.9731754]
[10]  
Guo JR, 2020, INT CONF ACOUST SPEE, P1603, DOI [10.1109/icassp40776.2020.9054164, 10.1109/ICASSP40776.2020.9054164]