A Dynamic Execution Neural Network Processor for Fine-Grained Mixed-Precision Model Training Based on Online Quantization Sensitivity Analysis

被引：1

作者：

Liu, Ruoyang ^{[1
]}

Wei, Chenhan ^{[1
]}

Yang, Yixiong ^{[2
]}

Wang, Wenxun ^{[1
]}

Yuan, Binbin ^{[3
]}

Yang, Huazhong ^{[1
]}

Liu, Yongpan ^{[1
]}

机构：

[1] Tsinghua Univ, Dept Elect Engn, Beijing 100084, Peoples R China

[2] Nvidia, Shanghai 201210, Peoples R China

[3] Traff Control Technol Co Ltd, Beijing 100160, Peoples R China

来源：

IEEE JOURNAL OF SOLID-STATE CIRCUITS | 2024年 / 59卷 / 09期

关键词：

Training; Artificial neural networks; Quantization (signal); Process control; Tensors; System-on-chip; Memory management; Dynamic precision (DP); fully quantized network training; low-bit training; mixed-precision quantization; neural network (NN) training accelerator;

D O I：

10.1109/JSSC.2024.3377292

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

As neural network (NN) training cost red has been growing exponentially over the past decade, developing high-speed and energy-efficient training methods has become an urgent task. Fine-grained mixed-precision low-bit training is the most promising way for high-efficiency training, but it needs dedicated processor designs to overcome the overhead in control, storage, and I/O and remove the power bottleneck in floating-point (FP) units. This article presents a dynamic execution NN processor supporting fine-grained mixed-precision training through an online quantization sensitivity analysis. Three key features are proposed: the quantization-sensitivity-aware dynamic execution controller, dynamic bit-width adaptive datapath design, and the low-power multi-level-aligned block-FP unit (BFPU). This chip achieves 13.2-TFLOPS/W energy efficiency and 1.07-TFLOPS/mm(2) area efficiency.

引用

页码：3082 / 3093

页数：12

共 37 条

[1] A 7nm 4-Core AI Chip with 25.6TFLOPS Hybrid FP8 Training, 102.4TOPS INT4 Inference and Workload-Aware Throttling [J].

Agrawal, Ankur ;

Lee, Sae Kyu ;

Silberman, Joel ;

Ziegler, Matthew ;

Kang, Mingu ;

Venkataramani, Swagath ;

Cao, Nianzheng ;

Fleischer, Bruce ;

Guillorn, Michael ;

Cohen, Matthew ;

Mueller, Silvia ;

Oh, Jinwook ;

Lutz, Martin ;

Jung, Jinwook ;

Koswatta, Siyu ;

Zhou, Ching ;

Zalani, Vidhi ;

Bonanno, James ;

Casatuta, Robert ;

Chen, Chia-Yu ;

Choi, Jungwook ;

Haynie, Howard ;

Herbert, Alyssa ;

Jain, Radhika ;

Kar, Monodeep ;

Kim, Kyu-Hyoun ;

Li, Yulong ;

Ren, Zhibin ;

Rider, Scot ;

Schaal, Marcel ;

Schelm, Kerstin ;

Scheuermann, Michael ;

Sun, Xiao ;

Tran, Hung ;

Wang, Naigang ;

Wang, Wei ;

Zhang, Xin ;

Shah, Vinay ;

Curran, Brian ;

Srinivasan, Vijayalakshmi ;

Lu, Pong-Fei ;

Shukla, Sunil ;

Chang, Leland ;

Gopalakrishnan, Kailash .

2021 IEEE INTERNATIONAL SOLID-STATE CIRCUITS CONFERENCE (ISSCC), 2021, 64 :144-+

[2]

Banner R, 2019, ADV NEUR IN, V32

[3]

Brown TB, 2020, ADV NEUR IN, V33

[4]

Chen Jianfei, 2021, P MACHINE LEARNING R, V139

[5]

Cottier Ben, 2023, Epoch AI blog

[6]

DDR3 SDRAM Standard, 2012, 793F JEDEC

[7]

Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848

[8]

Fu Zih-Sing, 2022, 2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits), P40, DOI 10.1109/VLSITechnologyandCir46769.2022.9830487

[9]

Fujiwara H., 2022, IEEE INT SOLID STATE, V65, P1, DOI [DOI 10.1109/ISSCC42614.2022.9731754, 10.1109/ISSCC42614.2022.9731754]

[10]

Guo JR, 2020, INT CONF ACOUST SPEE, P1603, DOI [10.1109/icassp40776.2020.9054164, 10.1109/ICASSP40776.2020.9054164]

← 1 2 3 4 →