An iterative mask estimation approach to deep learning based multi-channel speech recognition

被引：19

作者：

Tu, Yan-Hui ^{[1
]}

Du, Jun ^{[1
]}

Sun, Lei ^{[1
]}

Ma, Feng ^{[2
]}

Wang, Hai-Kun ^{[2
]}

Chen, Jing-Dong ^{[3
]}

Lee, Chin-Hui ^{[4
]}

机构：

[1] Univ Sci & Technol China, Hefei, Anhui, Peoples R China

[2] iFlytek Co Ltd, Hefei, Anhui, Peoples R China

[3] Northwestern Polytech Univ, Xian, Shaanxi, Peoples R China

[4] Georgia Inst Technol, Atlanta, GA 30332 USA

来源：

SPEECH COMMUNICATION | 2019年 / 106卷

基金：

国家重点研发计划; 中国国家自然科学基金;

关键词：

CHiME challenge; Deep learning; Ideal ratio mask (IRM); Microphone array; Robust speech recognition; BLIND SOURCE SEPARATION; ENHANCEMENT; NOISE;

D O I：

10.1016/j.specom.2018.11.005

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

We propose a novel iterative mask estimation (IME) framework to improve the state-of-the-art complex Gaussian mixture model (CGMM)-based beamforming approach in an iterative manner by leveraging upon the complementary information obtained from different deep models. Although CGMM has been recently demonstrated to be quite effective for multi-channel, automation speech recognition (ASR) in operational scenarios, the corresponding mask estimation, however, is not always accurate in adverse environments due to the lack of prior or context information. To address this problem, in this study, a neural-network-based ideal ratio mask estimator learned from a multi-condition data set is first adopted to incorporate prior information, obtained from the speech/noise interactions and the long acoustic context, into CGMM-based beamformed speech that has a higher signal-to-noise ratio (SNR) than the original noisy speech signal. Next, to further utilize the rich context information in deep acoustic and language models, voice activity detection information, obtained from speech recognition results, is then used to refine mask estimation, yielding a significant reduction in insertion errors. During testing on the recently launched CHiME-4 Challenge ASR task of recognizing 6-channel microphone array speech, the proposed IME approach significantly and consistently outperforms the CGMM approach under different configurations, with relative word error rate reductions ranging from 20% to 30%. Furthermore, the IME approach plays a key role in the ensemble system that achieves the best performance in the CHiME-4 Challenge.

引用

页码：31 / 43

页数：13

共 50 条

[1] A Space-and-Speaker-Aware Iterative Mask Estimation Approach to Multi-channel Speech Recognition in the CHiME-6 Challenge
Tu, Yan-Hui
Du, Jun
Sun, Lei
Ma, Feng
Pan, Jia
Lee, Chin-Hui
INTERSPEECH 2020, 2020, : 96 - 100
[2] An information fusion framework with multi-channel feature concatenation and multi-perspective system combination for the deep-learning-based robust recognition of microphone array speech
Tu, Dxxyan-Hui
Du, Jun
Wang, Qing
Bao, Xiao
Dai, Li-Rong
Lee, Chi-Hui
COMPUTER SPEECH AND LANGUAGE, 2017, 46 : 517 - 534
[3] On Design of Robust Deep Models for CHiME-4 Multi-Channel Speech Recognition with Multiple Configurations of Array Microphones
Tu, Yan-Hui
Du, Jun
Sun, Lei
Ma, Feng
Lee, Chin-Hui
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 394 - 398
[4] MULTI-CHANNEL AUTOMATIC SPEECH RECOGNITION USING DEEP COMPLEX UNET
Kong, Yuxiang
Wu, Jian
Wang, Quandong
Gao, Peng
Zhuang, Weiji
Wang, Yujun
Xie, Lei
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 104 - 110
[5] Acoustic Model Combination Incorporated With Mask-Based Multi-Channel Source Separation for Automatic Speech Recognition
Yoon, Jae Sam
Park, Ji Hun
Kim, Hong Kook
Kim, Hoirin
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2010, 4 (05) : 772 - 784
[6] A Deep Learning Approach to Multi-Channel and Multi-Microphone Acoustic Echo Cancellation
Zhang, Hao
Wang, DeLiang
INTERSPEECH 2021, 2021, : 1139 - 1143
[7] DMANET: DEEP LEARNING-BASED DIFFERENTIAL MICROPHONE ARRAYS FOR MULTI-CHANNEL SPEECH SEPARATION
Yang, Xiaokang
Wei, Jianguo
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4363 - 4367
[8] Deep MCANC: A deep learning approach to multi-channel active noise control
Zhang, Hao
Wang, DeLiang
NEURAL NETWORKS, 2023, 158 : 318 - 327
[9] Speech distortion weighted multi-channel Wiener filter and its application to speech recognition
Kim, Gibak
IEICE ELECTRONICS EXPRESS, 2015, 12 (06): : 1 - 7
[10] Multi-channel sub-band speech recognition
McCowan I.A.
Sridharan S.
EURASIP Journal on Advances in Signal Processing, 2001 (1) : 45 - 52

← 1 2 3 4 5 →