MULTI-CHANNEL AUTOMATIC SPEECH RECOGNITION USING DEEP COMPLEX UNET

被引:8
作者
Kong, Yuxiang [1 ,2 ]
Wu, Jian [1 ]
Wang, Quandong [2 ]
Gao, Peng [2 ]
Zhuang, Weiji [2 ]
Wang, Yujun [2 ]
Xie, Lei [1 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Audio Speech & Language Proc Grp ASLP NPU, Xian, Peoples R China
[2] Xiaomi Inc, Beijing, Peoples R China
来源
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT) | 2021年
关键词
Multi-channel speech recognition; robust speech recognition; deep learning; deep complex unet;
D O I
10.1109/SLT48900.2021.9383492
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The front-end module in multi-channel automatic speech recognition (ASR) systems mainly use microphone array techniques to produce enhanced signals in noisy conditions with reverberation and echos. Recently, neural network (NN) based front-end has shown promising improvement over the conventional signal processing methods. In this paper, we propose to adopt the architecture of deep complex Unet (DCUnet) - a powerful complex-valued Unet-structured speech enhancement model - as the front-end of the multi-channel acoustic model, and integrate them in a multi-task learning (MTL) framework along with cascaded framework for comparison. Meanwhile, we investigate the proposed methods with several training strategies to improve the recognition accuracy on the 1000-hours real-world XiaoMi smart speaker data with echos. Experiments show that our proposed DCUnet-MTL method brings about 12.2% relative character error rate (CER) reduction compared with the traditional approach with array processing plus single-channel acoustic model. It also achieves superior performance than the recently proposed neural beamforming method.
引用
收藏
页码:104 / 110
页数:7
相关论文
共 31 条
  • [1] [Anonymous], 2008, SPRINGER HDB SPEECH
  • [2] Benesty J, 2008, SPRINGER TOP SIGN PR, V1, P1
  • [3] Brandstein M., 2013, Microphone Arrays: Signal Processing Techniques and Applications
  • [4] Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
  • [5] Chen Z, 2015, 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, P3274
  • [6] Choi Hyeong-Seok, 2018, INT C LEARN REPR
  • [7] Giri R, 2015, INT CONF ACOUST SPEE, P5014, DOI 10.1109/ICASSP.2015.7178925
  • [8] Graves A., 2006, INT C MACH LEARN
  • [9] Hadian H, 2018, INTERSPEECH, P12
  • [10] Transcribing Meetings With the AMIDA Systems
    Hain, Thomas
    Burget, Lukas
    Dines, John
    Garner, Philip N.
    Grezl, Frantisek
    El Hannani, Asmaa
    Huijbregts, Marijn
    Karafiat, Martin
    Lincoln, Mike
    Wan, Vincent
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (02): : 486 - 498