Towards Scaling Up Classification-Based Speech Separation

被引:376
作者
Wang, Yuxuan [1 ]
Wang, DeLiang [1 ,2 ]
机构
[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA
[2] Ohio State Univ, Ctr Cognit Sci, Columbus, OH 43210 USA
来源
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2013年 / 21卷 / 07期
关键词
Computational auditory scene analysis (CASA); deep belief networks; feature learning; monaural speech separation; support vector machines; NOISE; INTELLIGIBILITY; SEGREGATION; ALGORITHM;
D O I
10.1109/TASL.2013.2250961
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Formulating speech separation as a binary classification problem has been shown to be effective. While good separation performance is achieved in matched test conditions using kernel support vector machines (SVMs), separation in unmatched conditions involving new speakers and environments remains a big challenge. A simple yet effective method to cope with the mismatch is to include many different acoustic conditions into the training set. However, large-scale training is almost intractable for kernel machines due to computational complexity. To enable training on relatively large datasets, we propose to learn more linearly separable and discriminative features from raw acoustic features and train linear SVMs, which are much easier and faster to train than kernel SVMs. For feature learning, we employ standard pre-trained deep neural networks (DNNs). The proposed DNN-SVM system is trained on a variety of acoustic conditions within a reasonable amount of time. Experiments on various test mixtures demonstrate good generalization to unseen speakers and background noises.
引用
收藏
页码:1381 / 1390
页数:10
相关论文
共 47 条
  • [31] Mask classification for missing-feature reconstruction for robust speech recognition in unknown background noise
    Kim, Wooil
    Stern, Richard M.
    [J]. SPEECH COMMUNICATION, 2011, 53 (01) : 1 - 11
  • [32] Larochelle H, 2009, J MACH LEARN RES, V10, P1
  • [33] Lee H, 2009, Proceedings of the 26th International Conference On Machine Learning, ICML 2009. June 14, V26, P609, DOI DOI 10.1145/1553374.1553453
  • [34] Factors influencing intelligibility of ideal binary-masked speech: Implications for noise reduction
    Li, Ning
    Loizou, Philipos C.
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2008, 123 (03) : 1673 - 1682
  • [35] Acoustic Modeling Using Deep Belief Networks
    Mohamed, Abdel-rahman
    Dahl, George E.
    Hinton, Geoffrey
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01): : 14 - 22
  • [36] A General Flexible Framework for the Handling of Prior Information in Audio Source Separation
    Ozerov, Alexey
    Vincent, Emmanuel
    Bimbot, Frederic
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (04): : 1118 - 1133
  • [37] Platt JC, 2000, ADV NEUR IN, P61
  • [38] Speech segregation based on sound localization
    Roman, N
    Wang, DL
    Brown, GJ
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2003, 114 (04) : 2236 - 2252
  • [39] A Bayesian classifier for spectrographic mask estimation for missing feature speech recognition
    Seltzer, ML
    Raj, B
    Stern, RM
    [J]. SPEECH COMMUNICATION, 2004, 43 (04) : 379 - 393
  • [40] Shalev-Shwartz S., 2007, P 24 INT C MACH LEAR, P807, DOI 10.1145/1273496.1273598