Self-Attention Channel Combinator Frontend for End-to-End Multichannel Far-field Speech Recognition

被引：8

作者：

Gong, Rong ^{[1
]}

Quillen, Carl ^{[2
]}

Sharma, Dushyant ^{[2
]}

Goderre, Andrew ^{[2
]}

Lainez, Jose ^{[3
]}

Milanovic, Ljubomir ^{[1
]}

机构：

[1] Nuance Commun GmbH, Vienna, Austria

[2] Nuance Commun Inc, Burlington, MA USA

[3] Nuance Commun SA, Madrid, Spain

来源：

INTERSPEECH 2021 | 2021年

关键词：

speech recognition; multichannel; self-attention; ASR frontend; channel combination; far-field; end-to-end;

D O I：

10.21437/Interspeech.2021-1190

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

When a sufficiently large far-field training data is presented, jointly optimizing a multichannel frontend and an end-to-end (E2E) Automatic Speech Recognition (ASR) backend shows promising results. Recent literature has shown traditional beam-former designs, such as MVDR (Minimum Variance Distortion-less Response) or fixed beamformers can be successfully integrated as the frontend into an E2E ASR system with learnable parameters. In this work, we propose the self-attention channel combinator (SACC) ASR frontend, which leverages the self-attention mechanism to combine multichannel audio signals in the magnitude spectral domain. Experiments conducted on a multichannel playback test data shows that the SACC achieved a 9.3% WERR compared to a state-of-the-art fixed beamformer-based frontend, both jointly optimized with a ContextNet-based ASR backend. We also demonstrate the connection between the SACC and the traditional beamformers, and analyze the intermediate outputs of the SACC.

引用

页码：3840 / 3844

页数：5

共 50 条

[1] Self-Attention Transducers for End-to-End Speech Recognition
Tian, Zhengkun
Yi, Jiangyan
Tao, Jianhua
Bai, Ye
Wen, Zhengqi
INTERSPEECH 2019, 2019, : 4395 - 4399
[2] Curriculum Learning based approaches for robust end-to-end far-field speech recognition
Ranjan, Shivesh
Hansen, John H. L.
SPEECH COMMUNICATION, 2021, 132 : 123 - 131
[3] Very Deep Self-Attention Networks for End-to-End Speech Recognition
Ngoc-Quan Pham
Thai-Son Nguyen
Niehues, Jan
Mueller, Markus
Waibel, Alex
INTERSPEECH 2019, 2019, : 66 - 70
[4] SAN-M: Memory Equipped Self-Attention for End-to-End Speech Recognition
Gao, Zhifu
Zhang, Shiliang
Lei, Ming
McLoughlin, Ian
INTERSPEECH 2020, 2020, : 6 - 10
[5] Efficient decoding self-attention for end-to-end speech synthesis
Zhao, Wei
Xu, Li
FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2022, 23 (07) : 1127 - 1138
[6] SIMPLIFIED SELF-ATTENTION FOR TRANSFORMER-BASED END-TO-END SPEECH RECOGNITION
Luo, Haoneng
Zhang, Shiliang
Lei, Ming
Xie, Lei
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 75 - 81
[7] End-to-End ASR with Adaptive Span Self-Attention
Chang, Xuankai
Subramanian, Aswin Shanmugam
Guo, Pengcheng
Watanabe, Shinji
Fujita, Yuya
Omachi, Motoi
INTERSPEECH 2020, 2020, : 3595 - 3599
[8] END-TO-END SPEECH SUMMARIZATION USING RESTRICTED SELF-ATTENTION
Sharma, Roshan
Palaskar, Shruti
Black, Alan W.
Metze, Florian
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8072 - 8076
[9] MULTICHANNEL AUDIO FRONT-END FOR FAR-FIELD AUTOMATIC SPEECH RECOGNITION
Chhetri, Amit
Hilmes, Philip
Kristjansson, Trausti
Chu, Wai
Mansour, Mohamed
Li, Xiaoxue
Zhang, Xianxian
2018 26TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2018, : 1527 - 1531
[10] Improving Hybrid CTC/Attention Architecture with Time-Restricted Self-Attention CTC for End-to-End Speech Recognition
Wu, Long
Li, Ta
Wang, Li
Yan, Yonghong
APPLIED SCIENCES-BASEL, 2019, 9 (21):

← 1 2 3 4 5 →