DNN-LSTM based VAD algorithm

被引:0
作者
Zhang X. [1 ]
Niu P. [1 ]
Gao F. [1 ]
机构
[1] College of Information Engineering, Taiyuan University of Technology, Taiyuan
来源
Qinghua Daxue Xuebao/Journal of Tsinghua University | 2018年 / 58卷 / 05期
关键词
Deep neural network; Long-short term memory; Voice activity detection;
D O I
10.16511/j.cnki.qhdxxb.2018.25.022
中图分类号
学科分类号
摘要
Voice activity detection (VAD) algorithms based on deep neural networks (DNN) ignore the temporal correlation of the acoustic features between speech frames which significantly reduces the performance in noisy environments. This paper presents a hybrid deep neural network with long-short term memory (LSTM) for VAD analyses which utilizes dynamic information from the speech frames. A context information based cost function is used to train the DNN-LSTM network. The noisy speech corpus used here was based on TIDIGITS and Noisex-92. The results show that the DNN-LSTM based VAD algorithm has better recognition accuracy than DNN-based VAD algorithms in noisy environment which shows that this cost function is more suitable than the traditional cost function. © 2018, Tsinghua University Press. All right reserved.
引用
收藏
页码:509 / 515
页数:6
相关论文
共 18 条
  • [1] Benyassine A., Shlomot E., Su H.Y., Et al., A robust low complexity voice activity detection algorithm for speech communication systems, Speech Coding for Telecommunications Proceeding. Pocono Manor, USA: IEEE, pp. 97-98, (1997)
  • [2] Cho N., Kim E.K., Enhanced voice activity detection using acoustic event detection and classification, IEEE Transactions on Consumer Electronics, 57, 1, pp. 196-202, (2011)
  • [3] Chang J.H., Kim N.S., Voice activity detection based on complex Laplacian model, Electronics Letters, 39, 7, pp. 632-634, (2003)
  • [4] Ramirez J., Yelamos P., Gorriz J.M., Et al., SVM-based speech endpoint detection using contextual speech features, Institution of Engineering and Technology, 42, 7, pp. 426-428, (2006)
  • [5] Zhang X.L., Wu J., Deep belief network based voice activity detection, Audio, Speech, and Language Processing, 21, 4, pp. 691-710, (2013)
  • [6] Ghosh P.K., Tsiartas A., Narayanan S., Robust voice activity detection using long-term signal variability, IEEE Transactions on Audio Speech & Language Processing, 19, 3, pp. 600-613, (2011)
  • [7] Salishev S., Barabanov A., Kocharov D., Et al., Voice activity detector (VAD) based on long-term Mel frequency band features, International Conference on Text, Speech, and Dialogue., pp. 352-358, (2016)
  • [8] Zhou Q., Ma L., Zheng Z., Et al., Recurrent neural word segmentation with tag inference, (2016)
  • [9] Has I.M., Sak, Senior A., Rao K., Et al., Learning acoustic frame labeling for speech recognition with recurrent neural networks, International Conference on Acoustics, Speech and Signal Processing. Brisbane, Australia: IEEE, pp. 4280-4284, (2015)
  • [10] Zhang X.L., Wang D., Boosted deep neural networks and multi-resolution cochleagram features for voice activity detection, Speech and Signal Processing, pp. 6645-6649, (2014)