CONVOLUTIVE TRANSFER FUNCTION INVARIANT SDR TRAINING CRITERIA FOR MULTI-CHANNEL REVERBERANT SPEECH SEPARATION

被引:18
作者
Boeddeker, Christoph [1 ]
Zhang, Wangyou [2 ]
Nakatani, Tomohiro [3 ]
Kinoshita, Keisuke [3 ]
Ochiai, Tsubasa [3 ]
Delcroix, Marc [3 ]
Kamo, Naoyuki [3 ]
Qian, Yanmin [2 ]
Haeb-Umbach, Reinhold [1 ]
机构
[1] Paderborn Univ, Dept Commun Engn, Paderborn, Germany
[2] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, SpeechLab, Shanghai, Peoples R China
[3] NTT Corp, Tokyo, Japan
来源
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年
关键词
Multi-channel source separation; acoustic beamforming; complex backpropagation; Signal-to-Distortion Ratio; NETWORKS;
D O I
10.1109/ICASSP39728.2021.9414661
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Time-domain training criteria have proven to be very effective for the separation of single-channel non-reverberant speech mixtures. Likewise, mask-based beamforming has shown impressive performance in multi-channel reverberant speech enhancement and source separation. Here, we propose to combine neural network supported multi-channel source separation with a time-domain training objective function. For the objective we propose to use a convolutive transfer function invariant Signal-to-Distortion Ratio (CI-SDR) based loss. While this is a well-known evaluation metric (BSS Eval), it has not been used as a training objective before. To show the effectiveness, we demonstrate the performance on LibriSpeech based reverberant mixtures. On this task, the proposed system approaches the error rate obtained on single-source non-reverberant input, i.e., LibriSpeech test clean, with a difference of only 1.2 percentage points, thus outperforming a conventional permutation invariant training based system and alternative objectives like Scale Invariant Signal-to-Distortion Ratio by a large margin.
引用
收藏
页码:8428 / 8432
页数:5
相关论文
共 32 条
[1]   IMAGE METHOD FOR EFFICIENTLY SIMULATING SMALL-ROOM ACOUSTICS [J].
ALLEN, JB ;
BERKLEY, DA .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1979, 65 (04) :943-950
[2]  
[Anonymous], 2006, ROOM IMPLUSE RESPONS
[3]  
Araki S., 2006, 2006 14 EUR SIGN PRO, P1
[4]  
Boeddeker C, 2017, INT CONF ACOUST SPEE, P171, DOI 10.1109/ICASSP.2017.7952140
[5]  
Chang XK, 2019, 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), P237, DOI [10.1109/asru46091.2019.9003986, 10.1109/ASRU46091.2019.9003986]
[6]   INDEPENDENT COMPONENT ANALYSIS, A NEW CONCEPT [J].
COMON, P .
SIGNAL PROCESSING, 1994, 36 (03) :287-314
[7]  
Drude L., 2019, ARXIV191013934
[8]   Integration of Neural Networks and Probabilistic Spatial Models for Acoustic Blind Source Separation [J].
Drude, Lukas ;
Haeb-Umbach, Reinhold .
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2019, 13 (04) :815-826
[9]  
Erdogan H, 2015, INT CONF ACOUST SPEE, P708, DOI 10.1109/ICASSP.2015.7178061
[10]  
Heitkaemper J, 2020, INT CONF ACOUST SPEE, P6359, DOI [10.1109/ICASSP40776.2020.9052981, 10.1109/icassp40776.2020.9052981]