Self-Attention Channel Combinator Frontend for End-to-End Multichannel Far-field Speech Recognition

被引:8
作者
Gong, Rong [1 ]
Quillen, Carl [2 ]
Sharma, Dushyant [2 ]
Goderre, Andrew [2 ]
Lainez, Jose [3 ]
Milanovic, Ljubomir [1 ]
机构
[1] Nuance Commun GmbH, Vienna, Austria
[2] Nuance Commun Inc, Burlington, MA USA
[3] Nuance Commun SA, Madrid, Spain
来源
INTERSPEECH 2021 | 2021年
关键词
speech recognition; multichannel; self-attention; ASR frontend; channel combination; far-field; end-to-end;
D O I
10.21437/Interspeech.2021-1190
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
When a sufficiently large far-field training data is presented, jointly optimizing a multichannel frontend and an end-to-end (E2E) Automatic Speech Recognition (ASR) backend shows promising results. Recent literature has shown traditional beam-former designs, such as MVDR (Minimum Variance Distortion-less Response) or fixed beamformers can be successfully integrated as the frontend into an E2E ASR system with learnable parameters. In this work, we propose the self-attention channel combinator (SACC) ASR frontend, which leverages the self-attention mechanism to combine multichannel audio signals in the magnitude spectral domain. Experiments conducted on a multichannel playback test data shows that the SACC achieved a 9.3% WERR compared to a state-of-the-art fixed beamformer-based frontend, both jointly optimized with a ContextNet-based ASR backend. We also demonstrate the connection between the SACC and the traditional beamformers, and analyze the intermediate outputs of the SACC.
引用
收藏
页码:3840 / 3844
页数:5
相关论文
共 50 条
[21]   An End-to-end Speech Recognition Algorithm based on Attention Mechanism [J].
Chen, Jia-nan ;
Gao, Shuang ;
Sun, Han-zhe ;
Liu, Xiao-hui ;
Wang, Zi-ning ;
Zheng, Yan .
PROCEEDINGS OF THE 39TH CHINESE CONTROL CONFERENCE, 2020, :2935-2940
[22]   Hybrid CTC/Attention Architecture for End-to-End Speech Recognition [J].
Watanabe, Shinji ;
Hori, Takaaki ;
Kim, Suyoun ;
Hershey, John R. ;
Hayashi, Tomoki .
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2017, 11 (08) :1240-1253
[23]   STRUCTURED SPARSE ATTENTION FOR END-TO-END AUTOMATIC SPEECH RECOGNITION [J].
Xue, Jiabin ;
Zheng, Tieran ;
Han, Jiqing .
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, :7044-7048
[24]   SELF-TRAINING FOR END-TO-END SPEECH RECOGNITION [J].
Kahn, Jacob ;
Lee, Ann ;
Hannun, Awni .
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, :7084-7088
[25]   Hybrid CTC/Attention End-to-End Chinese Speech Recognition Enhanced by Conformer [J].
Chen, Ge ;
Xie, Xukang ;
Sun, Jun ;
Chen, Qidong .
Computer Engineering and Applications, 2024, 59 (04) :97-103
[26]   CHARACTER-AWARE ATTENTION-BASED END-TO-END SPEECH RECOGNITION [J].
Meng, Zhong ;
Gaur, Yashesh ;
Li, Jinyu ;
Gong, Yifan .
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, :949-955
[27]   AN END-TO-END FAR-FIELD KEYWORD SPOTTING SYSTEM WITH NEURAL BEAMFORMING [J].
Ji, Xuan ;
Lu, Lu ;
Fang, Fuming ;
Ma, Jianbo ;
Zhu, Lei ;
Li, Jinke ;
Zhao, Dongdi ;
Liu, Ming ;
Jiang, Feijun .
2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, :892-899
[28]   EXPLICIT ALIGNMENT OF TEXT AND SPEECH ENCODINGS FOR ATTENTION-BASED END-TO-END SPEECH RECOGNITION [J].
Drexler, Jennifer ;
Glass, James .
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, :913-919
[29]   SELF-ATTENTION ALIGNER: A LATENCY-CONTROL END-TO-END MODEL FOR ASR USING SELF-ATTENTION NETWORK AND CHUNK-HOPPING [J].
Dong, Linhao ;
Wang, Feng ;
Xu, Bo .
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, :5656-5660
[30]   END-TO-END MULTI-CHANNEL TRANSFORMER FOR SPEECH RECOGNITION [J].
Chang, Feng-Ju ;
Radfar, Martin ;
Mouchtaris, Athanasios ;
King, Brian ;
Kunzmann, Siegfried .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :5884-5888