Long Short-Term Memory for Speaker Generalization in Supervised Speech Separation

被引：23

作者：

Chen, Jitong ^{[1
]}

Wang, DeLiang ^{[1
,2
]}

机构：

[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA

[2] Ohio State Univ, Ctr Cognit & Brain Sci, Columbus, OH 43210 USA

来源：

17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES | 2016年

关键词：

speech separation; speaker generalization; long short-term memory; NOISE; ALGORITHM;

D O I：

10.21437/Interspeech.2016-551

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Speech separation can be formulated as a supervised learning problem where a time-frequency mask is estimated by a learning machine from acoustic features of noisy speech. Deep neural networks (DNNs) have been successful for noise generalization in supervised separation. However, real world applications desire a trained model to perform well with both unseen speakers and unseen noises. In this study we investigate speaker generalization for noise-independent models and propose a separation model based on long short-term memory to account for the temporal dynamics of speech. Our experiments show that the proposed model significantly outperforms a DNN in terms of objective speech intelligibility for both seen and unseen speakers. Compared to feedforward networks, the proposed model is more capable of modeling a large number of speakers, and represents an effective approach for speaker- and noise-independent speech separation.

引用

页码：3314 / 3318

页数：5

共 20 条

[11]

Ng JYH, 2015, PROC CVPR IEEE, P4694, DOI 10.1109/CVPR.2015.7299101

[12]

Pascanu R., 2013, P INT C MACH LEARN, P1310, DOI DOI 10.5555/3042817.3043083

[13]

PAUL DB, 1992, SPEECH AND NATURAL LANGUAGE, P357

[14]

Sak H, 2014, INTERSPEECH, P338

[15] From Feedforward to Recurrent LSTM Neural Networks for Language Modeling [J].

Sundermeyer, Martin ;

Ney, Hermann ;

Schlueter, Ralf .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2015, 23 (03) :517-529

[16]

Sutskever I., 2014, ADV NEURAL INFORM PR, P3104, DOI DOI 10.5555/2969033.2969173

[17] An Algorithm for Intelligibility Prediction of Time-Frequency Weighted Noisy Speech [J].

Taal, Cees H. ;

Hendriks, Richard C. ;

Heusdens, Richard ;

Jensen, Jesper .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (07) :2125-2136

[18]

Wang D, 2006, Computational auditory scene analysis: Principles, algorithms, and applications

[19] On ideal binary mask as the computational goal of auditory scene analysis [J].

Wang, DL .

SPEECH SEPARATION BY HUMANS AND MACHINES, 2005, :181-197

[20] On Training Targets for Supervised Speech Separation [J].

Wang, Yuxuan ;

Narayanan, Arun ;

Wang, DeLiang .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (12) :1849-1858

← 1 2 →