Long Short-Term Memory for Speaker Generalization in Supervised Speech Separation

被引:23
作者
Chen, Jitong [1 ]
Wang, DeLiang [1 ,2 ]
机构
[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA
[2] Ohio State Univ, Ctr Cognit & Brain Sci, Columbus, OH 43210 USA
来源
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES | 2016年
关键词
speech separation; speaker generalization; long short-term memory; NOISE; ALGORITHM;
D O I
10.21437/Interspeech.2016-551
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech separation can be formulated as a supervised learning problem where a time-frequency mask is estimated by a learning machine from acoustic features of noisy speech. Deep neural networks (DNNs) have been successful for noise generalization in supervised separation. However, real world applications desire a trained model to perform well with both unseen speakers and unseen noises. In this study we investigate speaker generalization for noise-independent models and propose a separation model based on long short-term memory to account for the temporal dynamics of speech. Our experiments show that the proposed model significantly outperforms a DNN in terms of objective speech intelligibility for both seen and unseen speakers. Compared to feedforward networks, the proposed model is more capable of modeling a large number of speakers, and represents an effective approach for speaker- and noise-independent speech separation.
引用
收藏
页码:3314 / 3318
页数:5
相关论文
共 20 条
[11]  
Ng JYH, 2015, PROC CVPR IEEE, P4694, DOI 10.1109/CVPR.2015.7299101
[12]  
Pascanu R., 2013, P INT C MACH LEARN, P1310, DOI DOI 10.5555/3042817.3043083
[13]  
PAUL DB, 1992, SPEECH AND NATURAL LANGUAGE, P357
[14]  
Sak H, 2014, INTERSPEECH, P338
[15]   From Feedforward to Recurrent LSTM Neural Networks for Language Modeling [J].
Sundermeyer, Martin ;
Ney, Hermann ;
Schlueter, Ralf .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2015, 23 (03) :517-529
[16]  
Sutskever I., 2014, ADV NEURAL INFORM PR, P3104, DOI DOI 10.5555/2969033.2969173
[17]   An Algorithm for Intelligibility Prediction of Time-Frequency Weighted Noisy Speech [J].
Taal, Cees H. ;
Hendriks, Richard C. ;
Heusdens, Richard ;
Jensen, Jesper .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (07) :2125-2136
[18]  
Wang D, 2006, Computational auditory scene analysis: Principles, algorithms, and applications
[19]   On ideal binary mask as the computational goal of auditory scene analysis [J].
Wang, DL .
SPEECH SEPARATION BY HUMANS AND MACHINES, 2005, :181-197
[20]   On Training Targets for Supervised Speech Separation [J].
Wang, Yuxuan ;
Narayanan, Arun ;
Wang, DeLiang .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (12) :1849-1858