Binary neural networks for speech recognition

被引:0
作者
Yan-min Qian
Xu Xiang
机构
[1] Shanghai Jiao Tong University,Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering
[2] Shanghai Jiao Tong University,SpeechLab, Department of Computer Science and Engineering
来源
Frontiers of Information Technology & Electronic Engineering | 2019年 / 20卷
关键词
Speech recognition; Binary neural networks; Binary matrix multiplication; Knowledge distillation; Population count; TP391.4;
D O I
暂无
中图分类号
学科分类号
摘要
Recently, deep neural networks (DNNs) significantly outperform Gaussian mixture models in acoustic modeling for speech recognition. However, the substantial increase in computational load during the inference stage makes deep models difficult to directly deploy on low-power embedded devices. To alleviate this issue, structure sparseness and low precision fixed-point quantization have been applied widely. In this work, binary neural networks for speech recognition are developed to reduce the computational cost during the inference stage. A fast implementation of binary matrix multiplication is introduced. On modern central processing unit (CPU) and graphics processing unit (GPU) architectures, a 5–7 times speedup compared with full precision floatingpoint matrix multiplication can be achieved in real applications. Several kinds of binary neural networks and related model optimization algorithms are developed for large vocabulary continuous speech recognition acoustic modeling. In addition, to improve the accuracy of binary models, knowledge distillation from the normal full precision floating-point model to the compressed binary model is explored. Experiments on the standard Switchboard speech recognition task show that the proposed binary neural networks can deliver 3–4 times speedup over the normal full precision deep models. With the knowledge distillation from the normal floating-point models, the binary DNNs or binary convolutional neural networks (CNNs) can restrict the word error rate (WER) degradation to within 15.0%, compared to the normal full precision floating-point DNNs or CNNs, respectively. Particularly for the binary CNN with binarization only on the convolutional layers, the WER degradation is very small and is almost negligible with the proposed approach.
引用
收藏
页码:701 / 715
页数:14
相关论文
共 26 条
  • [1] Chen ZH(2017)Phone synchronous speech recognition with CTC lattices IEEE/ACM Trans Audio Speech Lang Process 25 90-101
  • [2] Zhuang YM(2018)Progressive joint modeling in unsupervised single-channel overlapped speech recognition IEEE/ACM Trans Audio Speech Lang Process 26 184-196
  • [3] Qian YM(2012)Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition IEEE Trans Audio Speech Lang Process 20 30-42
  • [4] Chen ZH(2011)Adaptive subgradient methods for online learning and stochastic optimization JMachLearnRes 12 2121-2159
  • [5] Droppo J(2014)Haswell: the fourth-generation Intel core processor IEEE Micro 34 6-20
  • [6] Li JY(2012)Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups IEEE Signal Process Mag 29 82-97
  • [7] Dahl GE(2017)Small-footprint highway deep neural networks for speech recognition IEEE/ACM Trans Audio Speech Lang Process 25 1502-1511
  • [8] Yu D(2012)Acoustic modeling using deep belief networks IEEE Trans Audio Speech Lang Process 20 14-22
  • [9] Deng L(2016)Very deep convolutional neural networks for noise robust speech recognition IEEE/ACM Trans Audio Speech Lang Process 24 2263-2276
  • [10] Duchi J(undefined)undefined undefined undefined undefined-undefined