A new joint CTC-attention-based speech recognition model with multi-level multi-head attention

被引:7
|
作者
Qin, Chu-Xiong [1 ]
Zhang, Wen-Lin [1 ]
Qu, Dan [1 ]
机构
[1] Natl Digital Switching Syst Engn & Technol R&D Ct, Zhengzhou, Peoples R China
基金
中国博士后科学基金; 中国国家自然科学基金;
关键词
Speech recognition; End-to-end; Attention mechanism;
D O I
10.1186/s13636-019-0161-0
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
A method called joint connectionist temporal classification (CTC)-attention-based speech recognition has recently received increasing focus and has achieved impressive performance. A hybrid end-to-end architecture that adds an extra CTC loss to the attention-based model could force extra restrictions on alignments. To explore better the end-to-end models, we propose improvements to the feature extraction and attention mechanism. First, we introduce a joint model trained with nonnegative matrix factorization (NMF)-based high-level features. Then, we put forward a hybrid attention mechanism by incorporating multi-head attentions and calculating attention scores over multi-level outputs. Experiments on TIMIT indicate that the new method achieves state-of-the-art performance with our best model. Experiments on WSJ show that our method exhibits a word error rate (WER) that is only 0.2% worse in absolute value than the best referenced method, which is trained on a much larger dataset, and it beats all present end-to-end methods. Further experiments on LibriSpeech show that our method is also comparable to the state-of-the-art end-to-end system in WER.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] A new joint CTC-attention-based speech recognition model with multi-level multi-head attention
    Chu-Xiong Qin
    Wen-Lin Zhang
    Dan Qu
    EURASIP Journal on Audio, Speech, and Music Processing, 2019
  • [2] Vectorized Beam Search for CTC-Attention-based Speech Recognition
    Seki, Hiroshi
    Hori, Takaaki
    Watanabe, Shinji
    Moritz, Niko
    Le Roux, Jonathan
    INTERSPEECH 2019, 2019, : 3825 - 3829
  • [3] Speech recognition based on the transformer's multi-head attention in Arabic
    Mahmoudi O.
    Filali-Bouami M.
    Benchat M.
    International Journal of Speech Technology, 2024, 27 (01) : 211 - 223
  • [4] Multimodal Approach of Speech Emotion Recognition Using Multi-Level Multi-Head Fusion Attention-Based Recurrent Neural Network
    Ngoc-Huynh Ho
    Yang, Hyung-Jeong
    Kim, Soo-Hyung
    Lee, Gueesang
    IEEE ACCESS, 2020, 8 : 61672 - 61686
  • [5] MULTI-HEAD ATTENTION FOR SPEECH EMOTION RECOGNITION WITH AUXILIARY LEARNING OF GENDER RECOGNITION
    Nediyanchath, Anish
    Paramasivam, Periyasamy
    Yenigalla, Promod
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7179 - 7183
  • [6] Multi-head attention fusion networks for multi-modal speech emotion recognition
    Zhang, Junfeng
    Xing, Lining
    Tan, Zhen
    Wang, Hongsen
    Wang, Kesheng
    COMPUTERS & INDUSTRIAL ENGINEERING, 2022, 168
  • [7] Self Multi-Head Attention for Speaker Recognition
    India, Miquel
    Safari, Pooyan
    Hernando, Javier
    INTERSPEECH 2019, 2019, : 4305 - 4309
  • [8] Multi-head attention model for aspect level sentiment analysis
    Zhang, Xinsheng
    Gao, Teng
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2020, 38 (01) : 89 - 96
  • [9] A fiber recognition framework based on multi-head attention mechanism
    Xu, Luoli
    Li, Fenying
    Chang, Shan
    TEXTILE RESEARCH JOURNAL, 2024, 94 (23-24) : 2629 - 2640
  • [10] Speech Emotion Recognition via Multi-Level Attention Network
    Liu, Ke
    Wang, Dekui
    Wu, Dongya
    Liu, Yutao
    Feng, Jun
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2278 - 2282