Deep Belief Networks Based Voice Activity Detection

被引:249
作者
Zhang, Xiao-Lei [1 ]
Wu, Ji [1 ]
机构
[1] Tsinghua Univ, Dept Elect Engn, Multimedia Signal & Intelligent Informat Proc Lab, Tsinghua Natl Lab Informat Sci & Technol, Beijing 100084, Peoples R China
来源
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2013年 / 21卷 / 04期
基金
中国博士后科学基金;
关键词
Deep learning; information fusion; voice activity detection; STATISTICAL-MODEL; MULTIPITCH TRACKING; ALGORITHM; CLASSIFICATION; SEGREGATION; NOISY;
D O I
10.1109/TASL.2012.2229986
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Fusing the advantages of multiple acoustic features is important for the robustness of voice activity detection (VAD). Recently, the machine-learning-based VADs have shown a superiority to traditional VADs on multiple feature fusion tasks. However, existing machine-learning-based VADs only utilize shallow models, which cannot explore the underlying manifold of the features. In this paper, we propose to fuse multiple features via a deep model, called deep belief network (DBN). DBN is a powerful hierarchical generative model for feature extraction. It can describe highly variant functions and discover the manifold of the features. We take the multiple serially-concatenated features as the input layer of DBN, and then extract a new feature by transferring these features through multiple nonlinear hidden layers. Finally, we predict the class of the new feature by a linear classifier. We further analyze that even a single-hidden-layer-based belief network is as powerful as the state-of-the-art models in the machine-learning-based VADs. In our empirical comparison, ten common features are used for performance analysis. Extensive experimental results on the AURORA2 corpus show that the DBN-based VAD not only outperforms eleven referenced VADs, but also can meet the real-time detection demand of VAD. The results also show that the DBN-based VAD can fuse the advantages of multiple features effectively.
引用
收藏
页码:697 / 710
页数:14
相关论文
共 73 条
  • [1] [Anonymous], 2004, TIAEIAIS127 3GPP2 CS
  • [2] [Anonymous], P NIPS
  • [3] [Anonymous], TUTORIAL INTERSPEECH
  • [4] [Anonymous], 2012, P INT C MACH LEARN
  • [5] [Anonymous], ETSI ES, V202
  • [6] [Anonymous], 2011, P JMLR WORK, DOI DOI 10.1109/IJCNN.2011.6033302
  • [7] Bengio Y., 2006, Advances in Neural Information Processing Systems, V19, DOI DOI 10.7551/MITPRESS/7503.003.0024
  • [8] Learning Deep Architectures for AI
    Bengio, Yoshua
    [J]. FOUNDATIONS AND TRENDS IN MACHINE LEARNING, 2009, 2 (01): : 1 - 127
  • [9] ITU-T recommendation G.729 Annex B: A silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications
    Benyassine, A
    Shlomot, E
    Su, HY
    Massaloux, D
    Lamblin, C
    Petit, JP
    [J]. IEEE COMMUNICATIONS MAGAZINE, 1997, 35 (09) : 64 - 73
  • [10] Bregman AS, 1994, AUDITORY SCENE ANAL, DOI DOI 10.1121/1.408434