Long short-term memory for speaker generalization in supervised speech separation

被引：194

作者：

Chen, Jitong ^{[1
]}

Wang, DeLiang ^{[1
,2
]}

机构：

[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA

[2] Ohio State Univ, Ctr Cognit & Brain Sci, Columbus, OH 43210 USA

来源：

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA | 2017年 / 141卷 / 06期

关键词：

NEURAL-NETWORKS; ALGORITHM; INTELLIGIBILITY; NOISE; MASKS;

D O I：

10.1121/1.4986931

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Speech separation can be formulated as learning to estimate a time-frequency mask from acoustic features extracted from noisy speech. For supervised speech separation, generalization to unseen noises and unseen speakers is a critical issue. Although deep neural networks (DNNs) have been successful in noise-independent speech separation, DNNs are limited in modeling a large number of speakers. To improve speaker generalization, a separation model based on long short-term memory (LSTM) is proposed, which naturally accounts for temporal dynamics of speech. Systematic evaluation shows that the proposed model substantially outperforms a DNN-based model on unseen speakers and unseen noises in terms of objective speech intelligibility. Analyzing LSTM internal representations reveals that LSTM captures long-term speech contexts. It is also found that the LSTM model is more advantageous for low-latency speech separation and it, without future frames, performs better than the DNN model with future frames. The proposed model represents an effective approach for speaker-and noise-independent speech separation. (C) 2017 Acoustical Society of America.

引用

页码：4705 / 4714

页数：10

共 37 条

[1]

[Anonymous], SPEECH ENHANCEMENT T

[2]

[Anonymous], 1997, Neural Computation

[3]

[Anonymous], 2016, ADV MATH PHYS, DOI DOI 10.1155/2016/3142068

[4] LEARNING LONG-TERM DEPENDENCIES WITH GRADIENT DESCENT IS DIFFICULT [J].

BENGIO, Y ;

SIMARD, P ;

FRASCONI, P .

IEEE TRANSACTIONS ON NEURAL NETWORKS, 1994, 5 (02) :157-166

[5] Long Short-Term Memory for Speaker Generalization in Supervised Speech Separation [J].

Chen, Jitong ;

Wang, DeLiang .

17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :3314-3318

[6] Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises [J].

Chen, Jitong ;

Wang, Yuxuan ;

Yoho, Sarah E. ;

Wang, DeLiang ;

Healy, Eric W. .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2016, 139 (05) :2604-2612

[7] SPEECH ENHANCEMENT USING A MINIMUM MEAN-SQUARE ERROR SHORT-TIME SPECTRAL AMPLITUDE ESTIMATOR [J].

EPHRAIM, Y ;

MALAH, D .

IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1984, 32 (06) :1109-1121

[8]

Erdogan H, 2015, INT CONF ACOUST SPEE, P708, DOI 10.1109/ICASSP.2015.7178061

[9] Minimum mean-square error estimation of discrete fourier coefficients with generalized gamma priors [J].

Erkelens, Jan S. ;

Hendriks, Richard C. ;

Heusdens, Richard ;

Jensen, Jesper .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (06) :1741-1752

[10] Learning to forget: Continual prediction with LSTM [J].

Gers, FA ;

Schmidhuber, J ;

Cummins, F .

NEURAL COMPUTATION, 2000, 12 (10) :2451-2471

← 1 2 3 4 →