MULTI-CHANNEL AUTOMATIC SPEECH RECOGNITION USING DEEP COMPLEX UNET

被引：8

作者：

Kong, Yuxiang ^{[1
,2
]}

Wu, Jian ^{[1
]}

Wang, Quandong ^{[2
]}

Gao, Peng ^{[2
]}

Zhuang, Weiji ^{[2
]}

Wang, Yujun ^{[2
]}

Xie, Lei ^{[1
]}

机构：

[1] Northwestern Polytech Univ, Sch Comp Sci, Audio Speech & Language Proc Grp ASLP NPU, Xian, Peoples R China

[2] Xiaomi Inc, Beijing, Peoples R China

来源：

2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT) | 2021年

关键词：

Multi-channel speech recognition; robust speech recognition; deep learning; deep complex unet;

D O I：

10.1109/SLT48900.2021.9383492

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The front-end module in multi-channel automatic speech recognition (ASR) systems mainly use microphone array techniques to produce enhanced signals in noisy conditions with reverberation and echos. Recently, neural network (NN) based front-end has shown promising improvement over the conventional signal processing methods. In this paper, we propose to adopt the architecture of deep complex Unet (DCUnet) - a powerful complex-valued Unet-structured speech enhancement model - as the front-end of the multi-channel acoustic model, and integrate them in a multi-task learning (MTL) framework along with cascaded framework for comparison. Meanwhile, we investigate the proposed methods with several training strategies to improve the recognition accuracy on the 1000-hours real-world XiaoMi smart speaker data with echos. Experiments show that our proposed DCUnet-MTL method brings about 12.2% relative character error rate (CER) reduction compared with the traditional approach with array processing plus single-channel acoustic model. It also achieves superior performance than the recently proposed neural beamforming method.

引用

页码：104 / 110

页数：7

共 31 条

[1] [Anonymous], 2008, SPRINGER HDB SPEECH
[2] Benesty J, 2008, SPRINGER TOP SIGN PR, V1, P1
[3] Brandstein M., 2013, Microphone Arrays: Signal Processing Techniques and Applications
[4] Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
[5] Chen Z, 2015, 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, P3274
[6] Choi Hyeong-Seok, 2018, INT C LEARN REPR
[7] Giri R, 2015, INT CONF ACOUST SPEE, P5014, DOI 10.1109/ICASSP.2015.7178925
[8] Graves A., 2006, INT C MACH LEARN
[9] Hadian H, 2018, INTERSPEECH, P12
[10] Transcribing Meetings With the AMIDA Systems
Hain, Thomas
Burget, Lukas
Dines, John
Garner, Philip N.
Grezl, Frantisek
El Hannani, Asmaa
Huijbregts, Marijn
Karafiat, Martin
Lincoln, Mike
Wan, Vincent
[J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (02): : 486 - 498

← 1 2 3 4 →