Long short-term memory for speaker generalization in supervised speech separation

被引:187
作者
Chen, Jitong [1 ]
Wang, DeLiang [1 ,2 ]
机构
[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA
[2] Ohio State Univ, Ctr Cognit & Brain Sci, Columbus, OH 43210 USA
关键词
NEURAL-NETWORKS; ALGORITHM; INTELLIGIBILITY; NOISE; MASKS;
D O I
10.1121/1.4986931
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech separation can be formulated as learning to estimate a time-frequency mask from acoustic features extracted from noisy speech. For supervised speech separation, generalization to unseen noises and unseen speakers is a critical issue. Although deep neural networks (DNNs) have been successful in noise-independent speech separation, DNNs are limited in modeling a large number of speakers. To improve speaker generalization, a separation model based on long short-term memory (LSTM) is proposed, which naturally accounts for temporal dynamics of speech. Systematic evaluation shows that the proposed model substantially outperforms a DNN-based model on unseen speakers and unseen noises in terms of objective speech intelligibility. Analyzing LSTM internal representations reveals that LSTM captures long-term speech contexts. It is also found that the LSTM model is more advantageous for low-latency speech separation and it, without future frames, performs better than the DNN model with future frames. The proposed model represents an effective approach for speaker-and noise-independent speech separation. (C) 2017 Acoustical Society of America.
引用
收藏
页码:4705 / 4714
页数:10
相关论文
共 37 条
  • [1] [Anonymous], SPEECH ENHANCEMENT T
  • [2] [Anonymous], 1997, Neural Computation
  • [3] [Anonymous], 2016, ADV MATH PHYS, DOI DOI 10.1155/2016/3142068
  • [4] LEARNING LONG-TERM DEPENDENCIES WITH GRADIENT DESCENT IS DIFFICULT
    BENGIO, Y
    SIMARD, P
    FRASCONI, P
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS, 1994, 5 (02): : 157 - 166
  • [5] Long Short-Term Memory for Speaker Generalization in Supervised Speech Separation
    Chen, Jitong
    Wang, DeLiang
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 3314 - 3318
  • [6] Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises
    Chen, Jitong
    Wang, Yuxuan
    Yoho, Sarah E.
    Wang, DeLiang
    Healy, Eric W.
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2016, 139 (05) : 2604 - 2612
  • [7] SPEECH ENHANCEMENT USING A MINIMUM MEAN-SQUARE ERROR SHORT-TIME SPECTRAL AMPLITUDE ESTIMATOR
    EPHRAIM, Y
    MALAH, D
    [J]. IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1984, 32 (06): : 1109 - 1121
  • [8] Erdogan H, 2015, INT CONF ACOUST SPEE, P708, DOI 10.1109/ICASSP.2015.7178061
  • [9] Minimum mean-square error estimation of discrete fourier coefficients with generalized gamma priors
    Erkelens, Jan S.
    Hendriks, Richard C.
    Heusdens, Richard
    Jensen, Jesper
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (06): : 1741 - 1752
  • [10] Learning to forget: Continual prediction with LSTM
    Gers, FA
    Schmidhuber, J
    Cummins, F
    [J]. NEURAL COMPUTATION, 2000, 12 (10) : 2451 - 2471