END-TO-END AUDIO-VISUAL SPEECH RECOGNITION WITH CONFORMERS

被引:105
|
作者
Ma, Pingchuan [1 ]
Petridis, Stavros [1 ]
Pantic, Maja [1 ]
机构
[1] Imperial Coll London, Dept Comp, London, England
来源
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年
关键词
audio-visual speech recognition; end-to-end training; convolution-augmented transformer;
D O I
10.1109/ICASSP39728.2021.9414567
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer), that can be trained in an end-to-end manner. In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms, respectively, which are then fed to conformers and then fusion takes place via a Multi-Layer Perceptron (MLP). The model learns to recognise characters using a combination of CTC and an attention mechanism. We show that end-to-end training, instead of using pre-computed visual features which is common in the literature, the use of a conformer, instead of a recurrent network, and the use of a transformer-based language model, significantly improve the performance of our model. We present results on the largest publicly available datasets for sentence-level speech recognition, Lip Reading Sentences 2 (LRS2) and Lip Reading Sentences 3 (LRS3), respectively. The results show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.
引用
收藏
页码:7613 / 7617
页数:5
相关论文
共 50 条
  • [1] End-to-end audio-visual speech recognition for overlapping speech
    Rose, Richard
    Siohan, Olivier
    Tripathi, Anshuman
    Braga, Otavio
    INTERSPEECH 2021, 2021, : 3016 - 3020
  • [2] An Improved End-to-End Audio-Visual Speech Recognition Model
    Yang, Sheng
    Gong, Zheng
    Kang, Jia
    INTERSPEECH 2023, 2023, : 3093 - 3097
  • [3] MODALITY ATTENTION FOR END-TO-END AUDIO-VISUAL SPEECH RECOGNITION
    Zhou, Pan
    Yang, Wenwen
    Chen, Wei
    Wang, Yanfeng
    Jia, Jia
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6565 - 6569
  • [4] FUSING INFORMATION STREAMS IN END-TO-END AUDIO-VISUAL SPEECH RECOGNITION
    Yu, Wentao
    Zeiler, Steffen
    Kolossa, Dorothea
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 3430 - 3434
  • [5] Investigating the Lombard Effect Influence on End-to-End Audio-Visual Speech Recognition
    Ma, Pingchuan
    Petridis, Stavros
    Pantic, Maja
    INTERSPEECH 2019, 2019, : 4090 - 4094
  • [6] Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition
    Hong, Joanna
    Kim, Minsu
    Yoo, Daehun
    Ro, Yong Man
    INTERSPEECH 2022, 2022, : 2838 - 2842
  • [7] Audio-Visual End-to-End Multi-Channel Speech Separation, Dereverberation and Recognition
    Li, Guinan
    Deng, Jiajun
    Geng, Mengzhe
    Jin, Zengrui
    Wang, Tianzi
    Hu, Shujie
    Cui, Mingyu
    Meng, Helen
    Liu, Xunying
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 2707 - 2723
  • [8] End-to-End Bloody Video Recognition by Audio-Visual Feature Fusion
    Hou, Congcong
    Wu, Xiaoyu
    Wang, Ge
    PATTERN RECOGNITION AND COMPUTER VISION (PRCV 2018), PT I, 2018, 11256 : 501 - 510
  • [9] End-to-End Audio-Visual Neural Speaker Diarization
    He, Mao-kui
    Du, Jun
    Lee, Chin-Hui
    INTERSPEECH 2022, 2022, : 1461 - 1465
  • [10] END-TO-END MULTI-PERSON AUDIO/VISUAL AUTOMATIC SPEECH RECOGNITION
    Braga, Otavio
    Makino, Takaki
    Siohan, Olivier
    Liao, Hank
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6994 - 6998