MULTI-CHANNEL AUTOMATIC SPEECH RECOGNITION USING DEEP COMPLEX UNET

被引：8

作者：

Kong, Yuxiang ^{[1
,2
]}

Wu, Jian ^{[1
]}

Wang, Quandong ^{[2
]}

Gao, Peng ^{[2
]}

Zhuang, Weiji ^{[2
]}

Wang, Yujun ^{[2
]}

Xie, Lei ^{[1
]}

机构：

[1] Northwestern Polytech Univ, Sch Comp Sci, Audio Speech & Language Proc Grp ASLP NPU, Xian, Peoples R China

[2] Xiaomi Inc, Beijing, Peoples R China

来源：

2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT) | 2021年

关键词：

Multi-channel speech recognition; robust speech recognition; deep learning; deep complex unet;

D O I：

10.1109/SLT48900.2021.9383492

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The front-end module in multi-channel automatic speech recognition (ASR) systems mainly use microphone array techniques to produce enhanced signals in noisy conditions with reverberation and echos. Recently, neural network (NN) based front-end has shown promising improvement over the conventional signal processing methods. In this paper, we propose to adopt the architecture of deep complex Unet (DCUnet) - a powerful complex-valued Unet-structured speech enhancement model - as the front-end of the multi-channel acoustic model, and integrate them in a multi-task learning (MTL) framework along with cascaded framework for comparison. Meanwhile, we investigate the proposed methods with several training strategies to improve the recognition accuracy on the 1000-hours real-world XiaoMi smart speaker data with echos. Experiments show that our proposed DCUnet-MTL method brings about 12.2% relative character error rate (CER) reduction compared with the traditional approach with array processing plus single-channel acoustic model. It also achieves superior performance than the recently proposed neural beamforming method.

引用

页码：104 / 110

页数：7

共 31 条

[21] U-Net: Convolutional Networks for Biomedical Image Segmentation
Ronneberger, Olaf
Fischer, Philipp
Brox, Thomas
[J]. MEDICAL IMAGE COMPUTING AND COMPUTER-ASSISTED INTERVENTION, PT III, 2015, 9351 : 234 - 241
[22] Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition
Sainath, Tara N.
Weiss, Ron J.
Wilson, Kevin W.
Li, Bo
Narayanan, Arun
Variani, Ehsan
Bacchiani, Michiel
Shafran, Izhak
Senior, Andrew
Chin, Kean
Misra, Ananya
Kim, Chanwoo
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2017, 25 (05) : 965 - 979
[23] Seltzer ML, 2008, 2008 HANDS-FREE SPEECH COMMUNICATION AND MICROPHONE ARRAYS, P105
[24] Tan K, 2020, IEEE-ACM T AUDIO SPE, V28, P380, DOI [10.1109/TASLP.2019.2955276, 10.1109/taslp.2019.2955276]
[25] Vaswani A, 2017, ADV NEUR IN, V30
[26] Complex Ratio Masking for Monaural Speech Separation
Williamson, Donald S.
Wang, Yuxuan
Wang, DeLiang
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2016, 24 (03) : 483 - 492
[27] Wu J, 2019, 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), P667, DOI [10.1109/ASRU46091.2019.9003983, 10.1109/asru46091.2019.9003983]
[28] Xiao X, 2016, INT CONF ACOUST SPEE, P5745, DOI 10.1109/ICASSP.2016.7472778
[29] Yin D., 2019, ARXIV191104697
[30] Attention-based LSTM with Multi-task Learning for Distant Speech Recognition
Zhang, Yu
Zhang, Pengyuan
Yan, Yonghong
[J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3857 - 3861

← 1 2 3 4 →