Multichannel Loss Function for Supervised Speech Source Separation by Mask-based Beamforming

被引:4
作者
Masuyama, Yoshiki [1 ,2 ]
Togami, Masahito [2 ]
Komatsu, Tatsuya [2 ]
机构
[1] Waseda Univ, Dept Intermedia Art & Sci, Tokyo, Japan
[2] LINE Corpolat, Tokyo, Japan
来源
INTERSPEECH 2019 | 2019年
关键词
Speaker-independent multi-talker separation; neural beamformer; multichannel Italura-Saito divergence;
D O I
10.21437/Interspeech.2019-1289
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
In this paper, we propose two mask-based beamforming methods using a deep neural network (DNN) trained by multichannel loss functions. Beamforming technique using time-frequency (TF)-masks estimated by a DNN have been applied to many applications where TF-masks are used for estimating spatial covariance matrices. To train a DNN for mask-based beamforming, loss functions designed for monaural speech enhancement/separation have been employed. Although such a training criterion is simple, it does not directly correspond to the performance of mask-based beamforming. To overcome this problem, we use multichannel loss functions which evaluate the estimated spatial covariance matrices based on the multichannel Itakura-Saito divergence. DNNs trained by the multichannel loss functions can be applied to construct several beamformers. Experimental results confirmed their effectiveness and robustness to microphone configurations.
引用
收藏
页码:2708 / 2712
页数:5
相关论文
共 35 条
  • [1] [Anonymous], 2017, IEEE INT C AC SPEECH
  • [2] Araki S, 2007, INT CONF ACOUST SPEE, P41
  • [3] Chen Zhuo, 2017, Proc IEEE Int Conf Acoust Speech Signal Process, V2017, P246, DOI 10.1109/ICASSP.2017.7952155
  • [4] GSVD-based optimal filtering for single and multimicrophone speech enhancement
    Doclo, S
    Moonen, M
    [J]. IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2002, 50 (09) : 2230 - 2244
  • [5] Tight integration of spatial and spectral features for BSS with Deep Clustering embeddings
    Drude, Lukas
    Haeb-Umbach, Reinhold
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2650 - 2654
  • [6] Under-Determined Reverberant Audio Source Separation Using a Full-Rank Spatial Covariance Model
    Duong, Ngoc Q. K.
    Vincent, Emmanuel
    Gribonval, Remi
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2010, 18 (07): : 1830 - 1840
  • [7] Erdogan H, 2015, INT CONF ACOUST SPEE, P708, DOI 10.1109/ICASSP.2015.7178061
  • [8] A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation
    Gannot, Sharon
    Vincent, Emmanuel
    Markovich-Golan, Shmulik
    Ozerov, Alexey
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2017, 25 (04) : 692 - 730
  • [9] Garofolo J. S., 1993, DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1
  • [10] Hadad E, 2014, 2014 14TH INTERNATIONAL WORKSHOP ON ACOUSTIC SIGNAL ENHANCEMENT (IWAENC), P313, DOI 10.1109/IWAENC.2014.6954309