Towards Scaling Up Classification-Based Speech Separation

被引：376

作者：

Wang, Yuxuan ^{[1
]}

Wang, DeLiang ^{[1
,2
]}

机构：

[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA

[2] Ohio State Univ, Ctr Cognit Sci, Columbus, OH 43210 USA

来源：

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2013年 / 21卷 / 07期

关键词：

Computational auditory scene analysis (CASA); deep belief networks; feature learning; monaural speech separation; support vector machines; NOISE; INTELLIGIBILITY; SEGREGATION; ALGORITHM;

D O I：

10.1109/TASL.2013.2250961

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Formulating speech separation as a binary classification problem has been shown to be effective. While good separation performance is achieved in matched test conditions using kernel support vector machines (SVMs), separation in unmatched conditions involving new speakers and environments remains a big challenge. A simple yet effective method to cope with the mismatch is to include many different acoustic conditions into the training set. However, large-scale training is almost intractable for kernel machines due to computational complexity. To enable training on relatively large datasets, we propose to learn more linearly separable and discriminative features from raw acoustic features and train linear SVMs, which are much easier and faster to train than kernel SVMs. For feature learning, we employ standard pre-trained deep neural networks (DNNs). The proposed DNN-SVM system is trained on a variety of acoustic conditions within a reasonable amount of time. Experiments on various test mixtures demonstrate good generalization to unseen speakers and background noises.

引用

页码：1381 / 1390

页数：10

共 47 条

[31] Mask classification for missing-feature reconstruction for robust speech recognition in unknown background noise
Kim, Wooil
Stern, Richard M.
[J]. SPEECH COMMUNICATION, 2011, 53 (01) : 1 - 11
[32] Larochelle H, 2009, J MACH LEARN RES, V10, P1
[33] Lee H, 2009, Proceedings of the 26th International Conference On Machine Learning, ICML 2009. June 14, V26, P609, DOI DOI 10.1145/1553374.1553453
[34] Factors influencing intelligibility of ideal binary-masked speech: Implications for noise reduction
Li, Ning
Loizou, Philipos C.
[J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2008, 123 (03) : 1673 - 1682
[35] Acoustic Modeling Using Deep Belief Networks
Mohamed, Abdel-rahman
Dahl, George E.
Hinton, Geoffrey
[J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01): : 14 - 22
[36] A General Flexible Framework for the Handling of Prior Information in Audio Source Separation
Ozerov, Alexey
Vincent, Emmanuel
Bimbot, Frederic
[J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (04): : 1118 - 1133
[37] Platt JC, 2000, ADV NEUR IN, P61
[38] Speech segregation based on sound localization
Roman, N
Wang, DL
Brown, GJ
[J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2003, 114 (04) : 2236 - 2252
[39] A Bayesian classifier for spectrographic mask estimation for missing feature speech recognition
Seltzer, ML
Raj, B
Stern, RM
[J]. SPEECH COMMUNICATION, 2004, 43 (04) : 379 - 393
[40] Shalev-Shwartz S., 2007, P 24 INT C MACH LEAR, P807, DOI 10.1145/1273496.1273598

← 1 2 3 4 5 →