Cross Attention with Monotonic Alignment for Speech Transformer

被引:4
|
作者
Zhao, Yingzhu [1 ,2 ,3 ]
Ni, Chongjia [2 ]
Leung, Cheung-Chi [2 ]
Joty, Shafiq [1 ]
Chng, Eng Siong [1 ]
Ma, Bin [2 ]
机构
[1] Nanyang Technol Univ, Singapore, Singapore
[2] Alibaba Grp, Machine Intelligence Technol, Hangzhou, Peoples R China
[3] Joint PhD Program Alibaba & Nanyang Technol Univ, Singapore, Singapore
来源
INTERSPEECH 2020 | 2020年
关键词
speech recognition; end-to-end; transformer; alignment; cross attention; HIDDEN MARKOV-MODELS;
D O I
10.21437/Interspeech.2020-1198
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Transformer, a state-of-the-art neural network architecture, has been used successfully for different sequence-to-sequence transformation tasks. This model architecture disperses the attention distribution over entire input to learn long-term dependencies, which is important for some sequence-to-sequence tasks, such as neural machine translation and text summarization. However, automatic speech recognition (ASR) has a characteristic to have monotonic alignment between text output and speech input. Techniques like Connectionist Temporal Classification (CTC), RNN Transducer (RNN-T) and Recurrent Neural Aligner (RNA) build on top of this monotonic alignment and use local encoded speech representations for corresponding token prediction. In this paper, we present an effective cross attention biasing technique in transformer that takes monotonic alignment between text output and speech input into consideration by making use of cross attention weights. Specifically, a Gaussian mask is applied on cross attention weights to limit the input speech context range locally given alignment information. We further introduce a regularizer for alignment regularization. Experiments on LibriSpeech dataset find that our proposed model can obtain improved output-input alignment for ASR, and yields 14.5%-25.0% relative word error rate (WER) reductions.
引用
收藏
页码:5031 / 5035
页数:5
相关论文
共 50 条
  • [21] A Cross Attention Transformer-Mixed Feedback Video Recommendation Algorithm Based on DIEN
    Zhang, Jianwei
    Zhao, Zhishang
    Cai, Zengyu
    Feng, Yuan
    Zhu, Liang
    Sun, Yahui
    CMC-COMPUTERS MATERIALS & CONTINUA, 2025, 82 (01): : 977 - 996
  • [22] Cross attention is all you need: relational remote sensing change detection with transformer
    Lu, Kaixuan
    Huang, Xiao
    Xia, Ruiheng
    Zhang, Pan
    Shen, Junping
    GISCIENCE & REMOTE SENSING, 2024, 61 (01)
  • [23] An efficient object tracking based on multi-head cross-attention transformer
    Dai, Jiahai
    Li, Huimin
    Jiang, Shan
    Yang, Hongwei
    EXPERT SYSTEMS, 2025, 42 (02)
  • [24] EEG-Transformer: Self-attention from Transformer Architecture for Decoding EEG of Imagined Speech
    Lee, Young-Eun
    Lee, Seo-Hyun
    10TH INTERNATIONAL WINTER CONFERENCE ON BRAIN-COMPUTER INTERFACE (BCI2022), 2022,
  • [25] Conformer: Convolution-augmented Transformer for Speech Recognition
    Gulati, Anmol
    Qin, James
    Chiu, Chung-Cheng
    Parmar, Niki
    Zhang, Yu
    Yu, Jiahui
    Han, Wei
    Wang, Shibo
    Zhang, Zhengdong
    Wu, Yonghui
    Pang, Ruoming
    INTERSPEECH 2020, 2020, : 5036 - 5040
  • [26] AN END-TO-END SPEECH ACCENT RECOGNITION METHOD BASED ON HYBRID CTC/ATTENTION TRANSFORMER ASR
    Gao, Qiang
    Wu, Haiwei
    Sun, Yanqing
    Duan, Yitao
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7253 - 7257
  • [27] TRANSFORMER TRANSDUCER: A STREAMABLE SPEECH RECOGNITION MODEL WITH TRANSFORMER ENCODERS AND RNN-T LOSS
    Mang, Qian
    Lu, Han
    Sak, Hasim
    Nipathi, Anshuman
    McDermott, Erik
    Koo, Stephen
    Kumar, Shankar
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7829 - 7833
  • [28] WaveNet With Cross-Attention for Audiovisual Speech Recognition
    Wang, Hui
    Gao, Fei
    Zhao, Yue
    Wu, Licheng
    IEEE ACCESS, 2020, 8 : 169160 - 169168
  • [29] Few Shot Medical Image Segmentation with Cross Attention Transformer
    Lin, Yi
    Chen, Yufan
    Cheng, Kwang-Ting
    Chen, Hao
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT II, 2023, 14221 : 233 - 243
  • [30] Cross-Modal Transformer-Based Neural Correction Models for Automatic Speech Recognition
    Tanaka, Tomohiro
    Masumura, Ryo
    Ihori, Mana
    Takashima, Akihiko
    Moriya, Takafumi
    Ashihara, Takanori
    Orihashi, Shota
    Makishima, Naoki
    INTERSPEECH 2021, 2021, : 4059 - 4063