Multichannel Loss Function for Supervised Speech Source Separation by Mask-based Beamforming

被引：4

作者：

Masuyama, Yoshiki ^{[1
,2
]}

Togami, Masahito ^{[2
]}

Komatsu, Tatsuya ^{[2
]}

机构：

[1] Waseda Univ, Dept Intermedia Art & Sci, Tokyo, Japan

[2] LINE Corpolat, Tokyo, Japan

来源：

INTERSPEECH 2019 | 2019年

关键词：

Speaker-independent multi-talker separation; neural beamformer; multichannel Italura-Saito divergence;

D O I：

10.21437/Interspeech.2019-1289

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

In this paper, we propose two mask-based beamforming methods using a deep neural network (DNN) trained by multichannel loss functions. Beamforming technique using time-frequency (TF)-masks estimated by a DNN have been applied to many applications where TF-masks are used for estimating spatial covariance matrices. To train a DNN for mask-based beamforming, loss functions designed for monaural speech enhancement/separation have been employed. Although such a training criterion is simple, it does not directly correspond to the performance of mask-based beamforming. To overcome this problem, we use multichannel loss functions which evaluate the estimated spatial covariance matrices based on the multichannel Itakura-Saito divergence. DNNs trained by the multichannel loss functions can be applied to construct several beamformers. Experimental results confirmed their effectiveness and robustness to microphone configurations.

引用

页码：2708 / 2712

页数：5

共 35 条

[1] [Anonymous], 2017, IEEE INT C AC SPEECH
[2] Araki S, 2007, INT CONF ACOUST SPEE, P41
[3] Chen Zhuo, 2017, Proc IEEE Int Conf Acoust Speech Signal Process, V2017, P246, DOI 10.1109/ICASSP.2017.7952155
[4] GSVD-based optimal filtering for single and multimicrophone speech enhancement
Doclo, S
Moonen, M
[J]. IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2002, 50 (09) : 2230 - 2244
[5] Tight integration of spatial and spectral features for BSS with Deep Clustering embeddings
Drude, Lukas
Haeb-Umbach, Reinhold
[J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2650 - 2654
[6] Under-Determined Reverberant Audio Source Separation Using a Full-Rank Spatial Covariance Model
Duong, Ngoc Q. K.
Vincent, Emmanuel
Gribonval, Remi
[J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2010, 18 (07): : 1830 - 1840
[7] Erdogan H, 2015, INT CONF ACOUST SPEE, P708, DOI 10.1109/ICASSP.2015.7178061
[8] A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation
Gannot, Sharon
Vincent, Emmanuel
Markovich-Golan, Shmulik
Ozerov, Alexey
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2017, 25 (04) : 692 - 730
[9] Garofolo J. S., 1993, DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1
[10] Hadad E, 2014, 2014 14TH INTERNATIONAL WORKSHOP ON ACOUSTIC SIGNAL ENHANCEMENT (IWAENC), P313, DOI 10.1109/IWAENC.2014.6954309

← 1 2 3 4 →