4-bit Quantization of LSTM-based Speech Recognition Models

被引：9

作者：

Fasoli, Andrea ^{[1
]}

Chen, Chia-Yu ^{[1
]}

Serrano, Mauricio ^{[1
]}

Sun, Xiao ^{[1
]}

Wang, Naigang ^{[1
]}

Venkataramani, Swagath ^{[1
]}

Saon, George ^{[1
]}

Cui, Xiaodong ^{[1
]}

Kingsbury, Brian ^{[1
]}

Zhang, Wei ^{[1
]}

Tuske, Zoltan ^{[1
]}

Gopalakrishnan, Kailash ^{[1
]}

机构：

[1] IBM Res, Armonk, NY 10504 USA

来源：

INTERSPEECH 2021 | 2021年

关键词：

LSTM; HMM; RNN-T; quantization; INT4;

D O I：

10.21437/Interspeech.2021-1962

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

We investigate the impact of aggressive low-precision representations of weights and activations in two families of large LSTM-based architectures for Automatic Speech Recognition (ASR): hybrid Deep Bidirectional LSTM - Hidden Markov Models (DBLSTM-HMMs) and Recurrent Neural Network - Transducers (RNN-Ts). Using a 4-bit integer representation, a naive quantization approach applied to the LSTM portion of these models results in significant Word Error Rate (WER) degradation. On the other hand, we show that minimal accuracy loss is achievable with an appropriate choice of quantizers and initializations. In particular, we customize quantization schemes depending on the local properties of the network, improving recognition performance while limiting computational time. We demonstrate our solution on the Switchboard (SWB) and CallHome (CH) test sets of the NIST Hub5-2000 evaluation. DBLSTM-HMMs trained with 300 or 2000 hours of SWB data achieves < 0.5% and < 1% average WER degradation, respectively. On the more challenging RNN-T models, our quantization strategy limits degradation in 4-bit inference to 1.3%.

引用

页码：2586 / 2590

页数：5

共 32 条

[1] A 7nm 4-Core AI Chip with 25.6TFLOPS Hybrid FP8 Training, 102.4TOPS INT4 Inference and Workload-Aware Throttling
Agrawal, Ankur
Lee, Sae Kyu
Silberman, Joel
Ziegler, Matthew
Kang, Mingu
Venkataramani, Swagath
Cao, Nianzheng
Fleischer, Bruce
Guillorn, Michael
Cohen, Matthew
Mueller, Silvia
Oh, Jinwook
Lutz, Martin
Jung, Jinwook
Koswatta, Siyu
Zhou, Ching
Zalani, Vidhi
Bonanno, James
Casatuta, Robert
Chen, Chia-Yu
Choi, Jungwook
Haynie, Howard
Herbert, Alyssa
Jain, Radhika
Kar, Monodeep
Kim, Kyu-Hyoun
Li, Yulong
Ren, Zhibin
Rider, Scot
Schaal, Marcel
Schelm, Kerstin
Scheuermann, Michael
Sun, Xiao
Tran, Hung
Wang, Naigang
Wang, Wei
Zhang, Xin
Shah, Vinay
Curran, Brian
Srinivasan, Vijayalakshmi
Lu, Pong-Fei
Shukla, Sunil
Chang, Leland
Gopalakrishnan, Kailash
[J]. 2021 IEEE INTERNATIONAL SOLID-STATE CIRCUITS CONFERENCE (ISSCC), 2021, 64 : 144 - +
[2] [Anonymous], 2018, INT C LEARN REPR
[3] Ardakani A., 2019, P INT C LEARN REPR
[4] Bengio Y., 2013, arXiv
[5] Distilling knowledge from ensembles of neural networks for speech recognition
Chebotar, Yevgen
Waters, Austin
[J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 3439 - 3443
[6] Choi J., 2018, ARXIV180706964
[7] Embedding-Based Speaker Adaptive Training of Deep Neural Networks
Cui, Xiaodong
Goel, Vaibhava
Saon, George
[J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 122 - 126
[8] Godfrey J. J., 1992, ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech and Signal Processing (Cat. No.92CH3103-9), P517, DOI 10.1109/ICASSP.1992.225858
[9] Goyal P., 2017, CoRR
[10] He Qinyao, 2016, Effective quantization methods for recurrent neural networks

← 1 2 3 4 →