End-to-End Dereverberation, Beamforming, and Speech Recognition in a Cocktail Party

被引：16

作者：

Zhang, Wangyou ^{[1
,2
]}

Chang, Xuankai ^{[3
]}

Boeddeker, Christoph ^{[4
]}

Nakatani, Tomohiro ^{[5
]}

Watanabe, Shinji ^{[3
]}

Qian, Yanmin ^{[1
,2
]}

机构：

[1] Shanghai Jiao Tong Univ, AI Inst, X LANCE Lab, Dept Comp Sci & Engn, Shanghai 200240, Peoples R China

[2] Shanghai Jiao Tong Univ, AI Inst, MoE Key Lab Artificial Intelligence, Shanghai 200240, Peoples R China

[3] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA

[4] Paderborn Univ, D-33098 Paderborn, Germany

[5] NTT Corp, Kyoto 6190237, Japan

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2022年 / 30卷 / 3173-3188期

关键词：

Training; Speech recognition; Array signal processing; Speech enhancement; Reverberation; Noise reduction; Feature extraction; End-to-end; dereverberation; beamforming; speech separation; multi-talker speech recognition; SEPARATION; ROBUST; NUMBER; NOISY;

D O I：

10.1109/TASLP.2022.3209942

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Far-field multi-speaker automatic speech recognition (ASR) has drawn increasing attention in recent years. Most existing methods feature a signal processing frontend and an ASR backend. In realistic scenarios, these modules are usually trained separately or progressively, which suffers from either inter-module mismatch or a complicated training process. In this paper, we propose an end-to-end multi-channel model that jointly optimizes the speech enhancement (including speech dereverberation, denoising, and separation) frontend and the ASR backend as a single system. To the best of our knowledge, this is the first work that proposes to optimize dereverberation, beamforming, and multi-speaker ASR in a fully end-to-end manner. The frontend module consists of a weighted prediction error (WPE) based submodule for dereverberation and a neural beamformer for denoising and speech separation. For the backend, we adopt a widely used end-to-end (E2E) ASR architecture. It is worth noting that the entire model is differentiable and can be optimized in a fully end-to-end manner using only the ASR criterion, without the need of parallel signal-level labels. We evaluate the proposed model on several multi-speaker benchmark datasets, and experimental results show that the fully E2E ASR model can achieve competitive performance on both noisy and reverberant conditions, with over 30% relative word error rate (WER) reduction over the single-channel baseline systems.

引用

页码：3173 / 3188

页数：16

共 81 条

[1]

[Anonymous], 1929, Z. Angew. Math. Mech, DOI 10.1002/zamm.19290090105

[2] On multiplicative transfer function approximation in the short-time Fourier transform domain [J].

Avargel, Yekutiel ;

Cohen, Israel .

IEEE SIGNAL PROCESSING LETTERS, 2007, 14 (05) :337-340

[3] CONVOLUTIVE TRANSFER FUNCTION INVARIANT SDR TRAINING CRITERIA FOR MULTI-CHANNEL REVERBERANT SPEECH SEPARATION [J].

Boeddeker, Christoph ;

Zhang, Wangyou ;

Nakatani, Tomohiro ;

Kinoshita, Keisuke ;

Ochiai, Tsubasa ;

Delcroix, Marc ;

Kamo, Naoyuki ;

Qian, Yanmin ;

Haeb-Umbach, Reinhold .

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :8428-8432

[4]

Boeddeker C, 2020, INT CONF ACOUST SPEE, P216, DOI [10.1109/icassp40776.2020.9054393, 10.1109/ICASSP40776.2020.9054393]

[5] HIGH-RESOLUTION FREQUENCY-WAVENUMBER SPECTRUM ANALYSIS [J].

CAPON, J .

PROCEEDINGS OF THE IEEE, 1969, 57 (08) :1408-&

[6] On the Numerical Instability of an LCMV Beamformer for a Uniform Linear Array [J].

Chakrabarty, Soumitro ;

Habets, Emanuel A. P. .

IEEE SIGNAL PROCESSING LETTERS, 2016, 23 (02) :272-276

[7]

Chang XK, 2020, INT CONF ACOUST SPEE, P6134, DOI [10.1109/ICASSP40776.2020.9054029, 10.1109/icassp40776.2020.9054029]

[8]

Chang XK, 2019, 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), P237, DOI [10.1109/asru46091.2019.9003986, 10.1109/ASRU46091.2019.9003986]

[9]

Chang XK, 2019, INT CONF ACOUST SPEE, P6256, DOI 10.1109/ICASSP.2019.8682822

[10]

Chen Z, 2018, IEEE W SP LANG TECH, P558, DOI 10.1109/SLT.2018.8639593

← 1 2 3 4 5 6 7 8 9 →