Self-Attention Channel Combinator Frontend for End-to-End Multichannel Far-field Speech Recognition

被引:8
|
作者
Gong, Rong [1 ]
Quillen, Carl [2 ]
Sharma, Dushyant [2 ]
Goderre, Andrew [2 ]
Lainez, Jose [3 ]
Milanovic, Ljubomir [1 ]
机构
[1] Nuance Commun GmbH, Vienna, Austria
[2] Nuance Commun Inc, Burlington, MA USA
[3] Nuance Commun SA, Madrid, Spain
来源
INTERSPEECH 2021 | 2021年
关键词
speech recognition; multichannel; self-attention; ASR frontend; channel combination; far-field; end-to-end;
D O I
10.21437/Interspeech.2021-1190
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
When a sufficiently large far-field training data is presented, jointly optimizing a multichannel frontend and an end-to-end (E2E) Automatic Speech Recognition (ASR) backend shows promising results. Recent literature has shown traditional beam-former designs, such as MVDR (Minimum Variance Distortion-less Response) or fixed beamformers can be successfully integrated as the frontend into an E2E ASR system with learnable parameters. In this work, we propose the self-attention channel combinator (SACC) ASR frontend, which leverages the self-attention mechanism to combine multichannel audio signals in the magnitude spectral domain. Experiments conducted on a multichannel playback test data shows that the SACC achieved a 9.3% WERR compared to a state-of-the-art fixed beamformer-based frontend, both jointly optimized with a ContextNet-based ASR backend. We also demonstrate the connection between the SACC and the traditional beamformers, and analyze the intermediate outputs of the SACC.
引用
收藏
页码:3840 / 3844
页数:5
相关论文
共 50 条
  • [1] Self-Attention Transducers for End-to-End Speech Recognition
    Tian, Zhengkun
    Yi, Jiangyan
    Tao, Jianhua
    Bai, Ye
    Wen, Zhengqi
    INTERSPEECH 2019, 2019, : 4395 - 4399
  • [2] Curriculum Learning based approaches for robust end-to-end far-field speech recognition
    Ranjan, Shivesh
    Hansen, John H. L.
    SPEECH COMMUNICATION, 2021, 132 : 123 - 131
  • [3] Very Deep Self-Attention Networks for End-to-End Speech Recognition
    Ngoc-Quan Pham
    Thai-Son Nguyen
    Niehues, Jan
    Mueller, Markus
    Waibel, Alex
    INTERSPEECH 2019, 2019, : 66 - 70
  • [4] SAN-M: Memory Equipped Self-Attention for End-to-End Speech Recognition
    Gao, Zhifu
    Zhang, Shiliang
    Lei, Ming
    McLoughlin, Ian
    INTERSPEECH 2020, 2020, : 6 - 10
  • [5] Efficient decoding self-attention for end-to-end speech synthesis
    Zhao, Wei
    Xu, Li
    FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2022, 23 (07) : 1127 - 1138
  • [6] SIMPLIFIED SELF-ATTENTION FOR TRANSFORMER-BASED END-TO-END SPEECH RECOGNITION
    Luo, Haoneng
    Zhang, Shiliang
    Lei, Ming
    Xie, Lei
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 75 - 81
  • [7] End-to-End ASR with Adaptive Span Self-Attention
    Chang, Xuankai
    Subramanian, Aswin Shanmugam
    Guo, Pengcheng
    Watanabe, Shinji
    Fujita, Yuya
    Omachi, Motoi
    INTERSPEECH 2020, 2020, : 3595 - 3599
  • [8] END-TO-END SPEECH SUMMARIZATION USING RESTRICTED SELF-ATTENTION
    Sharma, Roshan
    Palaskar, Shruti
    Black, Alan W.
    Metze, Florian
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8072 - 8076
  • [9] MULTICHANNEL AUDIO FRONT-END FOR FAR-FIELD AUTOMATIC SPEECH RECOGNITION
    Chhetri, Amit
    Hilmes, Philip
    Kristjansson, Trausti
    Chu, Wai
    Mansour, Mohamed
    Li, Xiaoxue
    Zhang, Xianxian
    2018 26TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2018, : 1527 - 1531
  • [10] Improving Hybrid CTC/Attention Architecture with Time-Restricted Self-Attention CTC for End-to-End Speech Recognition
    Wu, Long
    Li, Ta
    Wang, Li
    Yan, Yonghong
    APPLIED SCIENCES-BASEL, 2019, 9 (21):