Cross Attention with Monotonic Alignment for Speech Transformer

被引：4

作者：

Zhao, Yingzhu ^{[1
,2
,3
]}

Ni, Chongjia ^{[2
]}

Leung, Cheung-Chi ^{[2
]}

Joty, Shafiq ^{[1
]}

Chng, Eng Siong ^{[1
]}

Ma, Bin ^{[2
]}

机构：

[1] Nanyang Technol Univ, Singapore, Singapore

[2] Alibaba Grp, Machine Intelligence Technol, Hangzhou, Peoples R China

[3] Joint PhD Program Alibaba & Nanyang Technol Univ, Singapore, Singapore

来源：

INTERSPEECH 2020 | 2020年

关键词：

speech recognition; end-to-end; transformer; alignment; cross attention; HIDDEN MARKOV-MODELS;

D O I：

10.21437/Interspeech.2020-1198

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Transformer, a state-of-the-art neural network architecture, has been used successfully for different sequence-to-sequence transformation tasks. This model architecture disperses the attention distribution over entire input to learn long-term dependencies, which is important for some sequence-to-sequence tasks, such as neural machine translation and text summarization. However, automatic speech recognition (ASR) has a characteristic to have monotonic alignment between text output and speech input. Techniques like Connectionist Temporal Classification (CTC), RNN Transducer (RNN-T) and Recurrent Neural Aligner (RNA) build on top of this monotonic alignment and use local encoded speech representations for corresponding token prediction. In this paper, we present an effective cross attention biasing technique in transformer that takes monotonic alignment between text output and speech input into consideration by making use of cross attention weights. Specifically, a Gaussian mask is applied on cross attention weights to limit the input speech context range locally given alignment information. We further introduce a regularizer for alignment regularization. Experiments on LibriSpeech dataset find that our proposed model can obtain improved output-input alignment for ASR, and yields 14.5%-25.0% relative word error rate (WER) reductions.

引用

页码：5031 / 5035

页数：5

共 50 条

[41] Interactive CNN and Transformer-Based Cross-Attention Fusion Network for Medical Image Classification
Cai, Shu
Zhang, Qiude
Wang, Shanshan
Hu, Junjie
Zeng, Liang
Li, Kaiyan
INTERNATIONAL JOURNAL OF IMAGING SYSTEMS AND TECHNOLOGY, 2025, 35 (03)
[42] U-Shaped Transformer With Frequency-Band Aware Attention for Speech Enhancement
Li, Yi
Sun, Yang
Wang, Wenwu
Naqvi, Syed Mohsen
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 1511 - 1521
[43] An analysis of local monotonic attention variants
Merboldt, Andre
Zeyer, Albert
Schlueter, Ralf
Ney, Hermann
INTERSPEECH 2019, 2019, : 1398 - 1402
[44] ON THE USEFULNESS OF SELF-ATTENTION FOR AUTOMATIC SPEECH RECOGNITION WITH TRANSFORMERS
Zhang, Shucong
Loweimi, Erfan
Bell, Peter
Renals, Steve
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 89 - 96
[45] Towards non-monotonic sentence alignment
Quan, Xiaojun
Kit, Chunyu
INFORMATION SCIENCES, 2015, 323 : 34 - 47
[46] Beyond Universal Transformer: Block Reusing with Adaptor in Transformer for Automatic Speech Recognition
Tang, Haoyu
Liu, Zhaoyi
Zeng, Chang
Li, Xinfeng
ADVANCES IN NEURAL NETWORKS-ISNN 2024, 2024, 14827 : 69 - 79
[47] Bidirectional feature fusion via cross-attention transformer for chrysanthemum classification
Chen, Yifan
Yang, Xichen
Yan, Hui
Liu, Jia
Jiang, Jian
Mao, Zhongyuan
Wang, Tianshu
PATTERN ANALYSIS AND APPLICATIONS, 2025, 28 (02)
[48] Universal Speech Transformer
Zhao, Yingzhu
Ni, Chongjia
Leung, Cheung-Chi
Joty, Shafiq
Chng, Eng Siong
Ma, Bin
INTERSPEECH 2020, 2020, : 5021 - 5025
[49] U-Net Transformer: Self and Cross Attention for Medical Image Segmentation
Petit, Olivier
Thome, Nicolas
Rambour, Clement
Themyr, Loic
Collins, Toby
Soler, Luc
MACHINE LEARNING IN MEDICAL IMAGING, MLMI 2021, 2021, 12966 : 267 - 276
[50] BAYESIAN TRANSFORMER LANGUAGE MODELS FOR SPEECH RECOGNITION
Xue, Boyang
Yu, Jianwei
Xu, Junhao
Liu, Shansong
Hu, Shoukang
Ye, Zi
Geng, Mengzhe
Liu, Xunying
Meng, Helen
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7378 - 7382

← 1 2 3 4 5 →