4-bit Quantization of LSTM-based Speech Recognition Models

被引:9
作者
Fasoli, Andrea [1 ]
Chen, Chia-Yu [1 ]
Serrano, Mauricio [1 ]
Sun, Xiao [1 ]
Wang, Naigang [1 ]
Venkataramani, Swagath [1 ]
Saon, George [1 ]
Cui, Xiaodong [1 ]
Kingsbury, Brian [1 ]
Zhang, Wei [1 ]
Tuske, Zoltan [1 ]
Gopalakrishnan, Kailash [1 ]
机构
[1] IBM Res, Armonk, NY 10504 USA
来源
INTERSPEECH 2021 | 2021年
关键词
LSTM; HMM; RNN-T; quantization; INT4;
D O I
10.21437/Interspeech.2021-1962
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
We investigate the impact of aggressive low-precision representations of weights and activations in two families of large LSTM-based architectures for Automatic Speech Recognition (ASR): hybrid Deep Bidirectional LSTM - Hidden Markov Models (DBLSTM-HMMs) and Recurrent Neural Network - Transducers (RNN-Ts). Using a 4-bit integer representation, a naive quantization approach applied to the LSTM portion of these models results in significant Word Error Rate (WER) degradation. On the other hand, we show that minimal accuracy loss is achievable with an appropriate choice of quantizers and initializations. In particular, we customize quantization schemes depending on the local properties of the network, improving recognition performance while limiting computational time. We demonstrate our solution on the Switchboard (SWB) and CallHome (CH) test sets of the NIST Hub5-2000 evaluation. DBLSTM-HMMs trained with 300 or 2000 hours of SWB data achieves < 0.5% and < 1% average WER degradation, respectively. On the more challenging RNN-T models, our quantization strategy limits degradation in 4-bit inference to 1.3%.
引用
收藏
页码:2586 / 2590
页数:5
相关论文
共 32 条
  • [1] A 7nm 4-Core AI Chip with 25.6TFLOPS Hybrid FP8 Training, 102.4TOPS INT4 Inference and Workload-Aware Throttling
    Agrawal, Ankur
    Lee, Sae Kyu
    Silberman, Joel
    Ziegler, Matthew
    Kang, Mingu
    Venkataramani, Swagath
    Cao, Nianzheng
    Fleischer, Bruce
    Guillorn, Michael
    Cohen, Matthew
    Mueller, Silvia
    Oh, Jinwook
    Lutz, Martin
    Jung, Jinwook
    Koswatta, Siyu
    Zhou, Ching
    Zalani, Vidhi
    Bonanno, James
    Casatuta, Robert
    Chen, Chia-Yu
    Choi, Jungwook
    Haynie, Howard
    Herbert, Alyssa
    Jain, Radhika
    Kar, Monodeep
    Kim, Kyu-Hyoun
    Li, Yulong
    Ren, Zhibin
    Rider, Scot
    Schaal, Marcel
    Schelm, Kerstin
    Scheuermann, Michael
    Sun, Xiao
    Tran, Hung
    Wang, Naigang
    Wang, Wei
    Zhang, Xin
    Shah, Vinay
    Curran, Brian
    Srinivasan, Vijayalakshmi
    Lu, Pong-Fei
    Shukla, Sunil
    Chang, Leland
    Gopalakrishnan, Kailash
    [J]. 2021 IEEE INTERNATIONAL SOLID-STATE CIRCUITS CONFERENCE (ISSCC), 2021, 64 : 144 - +
  • [2] [Anonymous], 2018, INT C LEARN REPR
  • [3] Ardakani A., 2019, P INT C LEARN REPR
  • [4] Bengio Y., 2013, arXiv
  • [5] Distilling knowledge from ensembles of neural networks for speech recognition
    Chebotar, Yevgen
    Waters, Austin
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 3439 - 3443
  • [6] Choi J., 2018, ARXIV180706964
  • [7] Embedding-Based Speaker Adaptive Training of Deep Neural Networks
    Cui, Xiaodong
    Goel, Vaibhava
    Saon, George
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 122 - 126
  • [8] Godfrey J. J., 1992, ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech and Signal Processing (Cat. No.92CH3103-9), P517, DOI 10.1109/ICASSP.1992.225858
  • [9] Goyal P., 2017, CoRR
  • [10] He Qinyao, 2016, Effective quantization methods for recurrent neural networks