Cross Attention with Monotonic Alignment for Speech Transformer

被引:4
|
作者
Zhao, Yingzhu [1 ,2 ,3 ]
Ni, Chongjia [2 ]
Leung, Cheung-Chi [2 ]
Joty, Shafiq [1 ]
Chng, Eng Siong [1 ]
Ma, Bin [2 ]
机构
[1] Nanyang Technol Univ, Singapore, Singapore
[2] Alibaba Grp, Machine Intelligence Technol, Hangzhou, Peoples R China
[3] Joint PhD Program Alibaba & Nanyang Technol Univ, Singapore, Singapore
来源
INTERSPEECH 2020 | 2020年
关键词
speech recognition; end-to-end; transformer; alignment; cross attention; HIDDEN MARKOV-MODELS;
D O I
10.21437/Interspeech.2020-1198
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Transformer, a state-of-the-art neural network architecture, has been used successfully for different sequence-to-sequence transformation tasks. This model architecture disperses the attention distribution over entire input to learn long-term dependencies, which is important for some sequence-to-sequence tasks, such as neural machine translation and text summarization. However, automatic speech recognition (ASR) has a characteristic to have monotonic alignment between text output and speech input. Techniques like Connectionist Temporal Classification (CTC), RNN Transducer (RNN-T) and Recurrent Neural Aligner (RNA) build on top of this monotonic alignment and use local encoded speech representations for corresponding token prediction. In this paper, we present an effective cross attention biasing technique in transformer that takes monotonic alignment between text output and speech input into consideration by making use of cross attention weights. Specifically, a Gaussian mask is applied on cross attention weights to limit the input speech context range locally given alignment information. We further introduce a regularizer for alignment regularization. Experiments on LibriSpeech dataset find that our proposed model can obtain improved output-input alignment for ASR, and yields 14.5%-25.0% relative word error rate (WER) reductions.
引用
收藏
页码:5031 / 5035
页数:5
相关论文
共 50 条
  • [31] TSMCF: Transformer-Based SAR and Multispectral Cross-Attention Fusion for Cloud Removal
    Zhu, Hongming
    Wang, Zeju
    Han, Letong
    Xu, Manxin
    Li, Weiqi
    Liu, Qin
    Liu, Sicong
    Du, Bowen
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2025, 18 : 6710 - 6720
  • [32] A Cross-Attention-Based Multi-Information Fusion Transformer for Hyperspectral Image Classification
    Yang, Jinghui
    Li, Anqi
    Qian, Jinxi
    Qin, Jia
    Wang, Liguo
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2024, 17 : 13358 - 13375
  • [33] Transformer-based Cross attention and Feature Diversity for Occluded Person Re-identification
    Kang S.
    Kim S.
    Seo K.
    Transactions of the Korean Institute of Electrical Engineers, 2023, 72 (01): : 108 - 113
  • [34] AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio-Visual Speech Recognition
    Che, Na
    Zhu, Yiming
    Wang, Haiyan
    Zeng, Xianwei
    Du, Qinsheng
    APPLIED SCIENCES-BASEL, 2025, 15 (01):
  • [35] Cross-Attention Transformer-Based Domain Adaptation: A Novel Method for Fault Diagnosis of Rotating Machinery With High Generalizability and Alignment Capability
    Yin, Hua
    Chen, Qitong
    Chen, Liang
    Shen, Changqing
    IEEE SENSORS JOURNAL, 2024, 24 (23) : 40049 - 40058
  • [36] A Dual Cross Attention Transformer Network for Infrared and Visible Image Fusion
    Zhou, Zhuozhi
    Lan, Jinhui
    2024 7TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND BIG DATA, ICAIBD 2024, 2024, : 494 - 499
  • [37] DCAT: Dual Cross-Attention-Based Transformer for Change Detection
    Zhou, Yuan
    Huo, Chunlei
    Zhu, Jiahang
    Huo, Leigang
    Pan, Chunhong
    REMOTE SENSING, 2023, 15 (09)
  • [38] Layer Sparse Transformer for Speech Recognition
    Wang, Peng
    Guo, Zhiyuan
    Xie, Fei
    2023 IEEE INTERNATIONAL CONFERENCE ON KNOWLEDGE GRAPH, ICKG, 2023, : 269 - 273
  • [39] Semantic-alignment transformer and adversary hashing for cross-modal retrieval
    Sun, Yajun
    Wang, Meng
    Ma, Ying
    APPLIED INTELLIGENCE, 2024, 54 (17-18) : 7581 - 7602
  • [40] Multi-Feature Cross Attention-Induced Transformer Network for Hyperspectral and LiDAR Data Classification
    Li, Zirui
    Liu, Runbang
    Sun, Le
    Zheng, Yuhui
    REMOTE SENSING, 2024, 16 (15)