Cross Attention with Monotonic Alignment for Speech Transformer

被引:4
|
作者
Zhao, Yingzhu [1 ,2 ,3 ]
Ni, Chongjia [2 ]
Leung, Cheung-Chi [2 ]
Joty, Shafiq [1 ]
Chng, Eng Siong [1 ]
Ma, Bin [2 ]
机构
[1] Nanyang Technol Univ, Singapore, Singapore
[2] Alibaba Grp, Machine Intelligence Technol, Hangzhou, Peoples R China
[3] Joint PhD Program Alibaba & Nanyang Technol Univ, Singapore, Singapore
来源
INTERSPEECH 2020 | 2020年
关键词
speech recognition; end-to-end; transformer; alignment; cross attention; HIDDEN MARKOV-MODELS;
D O I
10.21437/Interspeech.2020-1198
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Transformer, a state-of-the-art neural network architecture, has been used successfully for different sequence-to-sequence transformation tasks. This model architecture disperses the attention distribution over entire input to learn long-term dependencies, which is important for some sequence-to-sequence tasks, such as neural machine translation and text summarization. However, automatic speech recognition (ASR) has a characteristic to have monotonic alignment between text output and speech input. Techniques like Connectionist Temporal Classification (CTC), RNN Transducer (RNN-T) and Recurrent Neural Aligner (RNA) build on top of this monotonic alignment and use local encoded speech representations for corresponding token prediction. In this paper, we present an effective cross attention biasing technique in transformer that takes monotonic alignment between text output and speech input into consideration by making use of cross attention weights. Specifically, a Gaussian mask is applied on cross attention weights to limit the input speech context range locally given alignment information. We further introduce a regularizer for alignment regularization. Experiments on LibriSpeech dataset find that our proposed model can obtain improved output-input alignment for ASR, and yields 14.5%-25.0% relative word error rate (WER) reductions.
引用
收藏
页码:5031 / 5035
页数:5
相关论文
共 50 条
  • [41] Interactive CNN and Transformer-Based Cross-Attention Fusion Network for Medical Image Classification
    Cai, Shu
    Zhang, Qiude
    Wang, Shanshan
    Hu, Junjie
    Zeng, Liang
    Li, Kaiyan
    INTERNATIONAL JOURNAL OF IMAGING SYSTEMS AND TECHNOLOGY, 2025, 35 (03)
  • [42] U-Shaped Transformer With Frequency-Band Aware Attention for Speech Enhancement
    Li, Yi
    Sun, Yang
    Wang, Wenwu
    Naqvi, Syed Mohsen
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 1511 - 1521
  • [43] An analysis of local monotonic attention variants
    Merboldt, Andre
    Zeyer, Albert
    Schlueter, Ralf
    Ney, Hermann
    INTERSPEECH 2019, 2019, : 1398 - 1402
  • [44] ON THE USEFULNESS OF SELF-ATTENTION FOR AUTOMATIC SPEECH RECOGNITION WITH TRANSFORMERS
    Zhang, Shucong
    Loweimi, Erfan
    Bell, Peter
    Renals, Steve
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 89 - 96
  • [45] Towards non-monotonic sentence alignment
    Quan, Xiaojun
    Kit, Chunyu
    INFORMATION SCIENCES, 2015, 323 : 34 - 47
  • [46] Beyond Universal Transformer: Block Reusing with Adaptor in Transformer for Automatic Speech Recognition
    Tang, Haoyu
    Liu, Zhaoyi
    Zeng, Chang
    Li, Xinfeng
    ADVANCES IN NEURAL NETWORKS-ISNN 2024, 2024, 14827 : 69 - 79
  • [47] Bidirectional feature fusion via cross-attention transformer for chrysanthemum classification
    Chen, Yifan
    Yang, Xichen
    Yan, Hui
    Liu, Jia
    Jiang, Jian
    Mao, Zhongyuan
    Wang, Tianshu
    PATTERN ANALYSIS AND APPLICATIONS, 2025, 28 (02)
  • [48] Universal Speech Transformer
    Zhao, Yingzhu
    Ni, Chongjia
    Leung, Cheung-Chi
    Joty, Shafiq
    Chng, Eng Siong
    Ma, Bin
    INTERSPEECH 2020, 2020, : 5021 - 5025
  • [49] U-Net Transformer: Self and Cross Attention for Medical Image Segmentation
    Petit, Olivier
    Thome, Nicolas
    Rambour, Clement
    Themyr, Loic
    Collins, Toby
    Soler, Luc
    MACHINE LEARNING IN MEDICAL IMAGING, MLMI 2021, 2021, 12966 : 267 - 276
  • [50] BAYESIAN TRANSFORMER LANGUAGE MODELS FOR SPEECH RECOGNITION
    Xue, Boyang
    Yu, Jianwei
    Xu, Junhao
    Liu, Shansong
    Hu, Shoukang
    Ye, Zi
    Geng, Mengzhe
    Liu, Xunying
    Meng, Helen
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7378 - 7382