End-to-End Dereverberation, Beamforming, and Speech Recognition in a Cocktail Party

被引:16
作者
Zhang, Wangyou [1 ,2 ]
Chang, Xuankai [3 ]
Boeddeker, Christoph [4 ]
Nakatani, Tomohiro [5 ]
Watanabe, Shinji [3 ]
Qian, Yanmin [1 ,2 ]
机构
[1] Shanghai Jiao Tong Univ, AI Inst, X LANCE Lab, Dept Comp Sci & Engn, Shanghai 200240, Peoples R China
[2] Shanghai Jiao Tong Univ, AI Inst, MoE Key Lab Artificial Intelligence, Shanghai 200240, Peoples R China
[3] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA
[4] Paderborn Univ, D-33098 Paderborn, Germany
[5] NTT Corp, Kyoto 6190237, Japan
关键词
Training; Speech recognition; Array signal processing; Speech enhancement; Reverberation; Noise reduction; Feature extraction; End-to-end; dereverberation; beamforming; speech separation; multi-talker speech recognition; SEPARATION; ROBUST; NUMBER; NOISY;
D O I
10.1109/TASLP.2022.3209942
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Far-field multi-speaker automatic speech recognition (ASR) has drawn increasing attention in recent years. Most existing methods feature a signal processing frontend and an ASR backend. In realistic scenarios, these modules are usually trained separately or progressively, which suffers from either inter-module mismatch or a complicated training process. In this paper, we propose an end-to-end multi-channel model that jointly optimizes the speech enhancement (including speech dereverberation, denoising, and separation) frontend and the ASR backend as a single system. To the best of our knowledge, this is the first work that proposes to optimize dereverberation, beamforming, and multi-speaker ASR in a fully end-to-end manner. The frontend module consists of a weighted prediction error (WPE) based submodule for dereverberation and a neural beamformer for denoising and speech separation. For the backend, we adopt a widely used end-to-end (E2E) ASR architecture. It is worth noting that the entire model is differentiable and can be optimized in a fully end-to-end manner using only the ASR criterion, without the need of parallel signal-level labels. We evaluate the proposed model on several multi-speaker benchmark datasets, and experimental results show that the fully E2E ASR model can achieve competitive performance on both noisy and reverberant conditions, with over 30% relative word error rate (WER) reduction over the single-channel baseline systems.
引用
收藏
页码:3173 / 3188
页数:16
相关论文
共 81 条
[1]  
[Anonymous], 1929, Z. Angew. Math. Mech, DOI 10.1002/zamm.19290090105
[2]   On multiplicative transfer function approximation in the short-time Fourier transform domain [J].
Avargel, Yekutiel ;
Cohen, Israel .
IEEE SIGNAL PROCESSING LETTERS, 2007, 14 (05) :337-340
[3]   CONVOLUTIVE TRANSFER FUNCTION INVARIANT SDR TRAINING CRITERIA FOR MULTI-CHANNEL REVERBERANT SPEECH SEPARATION [J].
Boeddeker, Christoph ;
Zhang, Wangyou ;
Nakatani, Tomohiro ;
Kinoshita, Keisuke ;
Ochiai, Tsubasa ;
Delcroix, Marc ;
Kamo, Naoyuki ;
Qian, Yanmin ;
Haeb-Umbach, Reinhold .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :8428-8432
[4]  
Boeddeker C, 2020, INT CONF ACOUST SPEE, P216, DOI [10.1109/icassp40776.2020.9054393, 10.1109/ICASSP40776.2020.9054393]
[5]   HIGH-RESOLUTION FREQUENCY-WAVENUMBER SPECTRUM ANALYSIS [J].
CAPON, J .
PROCEEDINGS OF THE IEEE, 1969, 57 (08) :1408-&
[6]   On the Numerical Instability of an LCMV Beamformer for a Uniform Linear Array [J].
Chakrabarty, Soumitro ;
Habets, Emanuel A. P. .
IEEE SIGNAL PROCESSING LETTERS, 2016, 23 (02) :272-276
[7]  
Chang XK, 2020, INT CONF ACOUST SPEE, P6134, DOI [10.1109/ICASSP40776.2020.9054029, 10.1109/icassp40776.2020.9054029]
[8]  
Chang XK, 2019, 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), P237, DOI [10.1109/asru46091.2019.9003986, 10.1109/ASRU46091.2019.9003986]
[9]  
Chang XK, 2019, INT CONF ACOUST SPEE, P6256, DOI 10.1109/ICASSP.2019.8682822
[10]  
Chen Z, 2018, IEEE W SP LANG TECH, P558, DOI 10.1109/SLT.2018.8639593