WaveNet With Cross-Attention for Audiovisual Speech Recognition

被引:7
作者
Wang, Hui [1 ]
Gao, Fei [1 ]
Zhao, Yue [1 ]
Wu, Licheng [1 ]
机构
[1] Minzu Univ China, Sch Informat Engn, Beijing 100081, Peoples R China
基金
中国国家自然科学基金;
关键词
Speech recognition; Visualization; Acoustics; Lips; Mathematical model; Feature extraction; Mouth; Cross-attention mechanism; multimodal speech recognition; WaveNet model; end-to-end-model;
D O I
10.1109/ACCESS.2020.3024218
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, the WaveNet with cross-attention is proposed for Audio-Visual Automatic Speech Recognition (AV-ASR) to address multimodal feature fusion and frame alignment problems between two data streams. WaveNet is usually used for speech generation and speech recognition, however, in this paper, we extent it to audiovisual speech recognition, and the cross-attention mechanism is introduced into different places of WaveNet for feature fusion. The proposed cross-attention mechanism tries to explore the correlated frames of visual feature to the acoustic feature frame. The experimental results show that the WaveNet with cross-attention can reduce the Tibetan single syllable error about 4.5% and English word error about 39.8% relative to the audio-only speech recognition, and reduce Tibetan single syllable error about 35.1% and English word error about 21.6% relative to the conventional feature concatenation method for AV-ASR.
引用
收藏
页码:169160 / 169168
页数:9
相关论文
共 28 条
  • [1] Turbo Decoders for Audio-visual Continuous Speech Recognition
    Abdelaziz, Ahmed Hussen
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3667 - 3671
  • [2] Aleksic P. S., 2004, P IEEE INT C AC SPEE, V5, DOI [10.1109/ICASSP.2004.1327261, DOI 10.1109/ICASSP.2004.1327261]
  • [3] [Anonymous], 2015, ARXIV150706947
  • [4] [Anonymous], 2017, ARXIV170806073
  • [5] [Anonymous], 2018, ARXIV180409713
  • [6] [Anonymous], 2016, P 9 ISCA WORKSH SPEE
  • [7] [Anonymous], 2016, ARXIV160106759
  • [8] Bahdanau D., 2014, Neural machine translation
  • [9] Braun S, 2018, INTERSPEECH, P17
  • [10] Lip Reading Sentences in the Wild
    Chung, Joon Son
    Senior, Andrew
    Vinyals, Oriol
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 3444 - 3450