MULTI-CHANNEL AUTOMATIC SPEECH RECOGNITION USING DEEP COMPLEX UNET

被引:8
作者
Kong, Yuxiang [1 ,2 ]
Wu, Jian [1 ]
Wang, Quandong [2 ]
Gao, Peng [2 ]
Zhuang, Weiji [2 ]
Wang, Yujun [2 ]
Xie, Lei [1 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Audio Speech & Language Proc Grp ASLP NPU, Xian, Peoples R China
[2] Xiaomi Inc, Beijing, Peoples R China
来源
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT) | 2021年
关键词
Multi-channel speech recognition; robust speech recognition; deep learning; deep complex unet;
D O I
10.1109/SLT48900.2021.9383492
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The front-end module in multi-channel automatic speech recognition (ASR) systems mainly use microphone array techniques to produce enhanced signals in noisy conditions with reverberation and echos. Recently, neural network (NN) based front-end has shown promising improvement over the conventional signal processing methods. In this paper, we propose to adopt the architecture of deep complex Unet (DCUnet) - a powerful complex-valued Unet-structured speech enhancement model - as the front-end of the multi-channel acoustic model, and integrate them in a multi-task learning (MTL) framework along with cascaded framework for comparison. Meanwhile, we investigate the proposed methods with several training strategies to improve the recognition accuracy on the 1000-hours real-world XiaoMi smart speaker data with echos. Experiments show that our proposed DCUnet-MTL method brings about 12.2% relative character error rate (CER) reduction compared with the traditional approach with array processing plus single-channel acoustic model. It also achieves superior performance than the recently proposed neural beamforming method.
引用
收藏
页码:104 / 110
页数:7
相关论文
共 31 条
  • [21] U-Net: Convolutional Networks for Biomedical Image Segmentation
    Ronneberger, Olaf
    Fischer, Philipp
    Brox, Thomas
    [J]. MEDICAL IMAGE COMPUTING AND COMPUTER-ASSISTED INTERVENTION, PT III, 2015, 9351 : 234 - 241
  • [22] Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition
    Sainath, Tara N.
    Weiss, Ron J.
    Wilson, Kevin W.
    Li, Bo
    Narayanan, Arun
    Variani, Ehsan
    Bacchiani, Michiel
    Shafran, Izhak
    Senior, Andrew
    Chin, Kean
    Misra, Ananya
    Kim, Chanwoo
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2017, 25 (05) : 965 - 979
  • [23] Seltzer ML, 2008, 2008 HANDS-FREE SPEECH COMMUNICATION AND MICROPHONE ARRAYS, P105
  • [24] Tan K, 2020, IEEE-ACM T AUDIO SPE, V28, P380, DOI [10.1109/TASLP.2019.2955276, 10.1109/taslp.2019.2955276]
  • [25] Vaswani A, 2017, ADV NEUR IN, V30
  • [26] Complex Ratio Masking for Monaural Speech Separation
    Williamson, Donald S.
    Wang, Yuxuan
    Wang, DeLiang
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2016, 24 (03) : 483 - 492
  • [27] Wu J, 2019, 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), P667, DOI [10.1109/ASRU46091.2019.9003983, 10.1109/asru46091.2019.9003983]
  • [28] Xiao X, 2016, INT CONF ACOUST SPEE, P5745, DOI 10.1109/ICASSP.2016.7472778
  • [29] Yin D., 2019, ARXIV191104697
  • [30] Attention-based LSTM with Multi-task Learning for Distant Speech Recognition
    Zhang, Yu
    Zhang, Pengyuan
    Yan, Yonghong
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3857 - 3861