END-TO-END AUDIO-VISUAL SPEECH RECOGNITION WITH CONFORMERS

被引：105

作者：

Ma, Pingchuan ^{[1
]}

Petridis, Stavros ^{[1
]}

Pantic, Maja ^{[1
]}

机构：

[1] Imperial Coll London, Dept Comp, London, England

来源：

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年

关键词：

audio-visual speech recognition; end-to-end training; convolution-augmented transformer;

D O I：

10.1109/ICASSP39728.2021.9414567

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer), that can be trained in an end-to-end manner. In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms, respectively, which are then fed to conformers and then fusion takes place via a Multi-Layer Perceptron (MLP). The model learns to recognise characters using a combination of CTC and an attention mechanism. We show that end-to-end training, instead of using pre-computed visual features which is common in the literature, the use of a conformer, instead of a recurrent network, and the use of a transformer-based language model, significantly improve the performance of our model. We present results on the largest publicly available datasets for sentence-level speech recognition, Lip Reading Sentences 2 (LRS2) and Lip Reading Sentences 3 (LRS3), respectively. The results show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.

引用

页码：7613 / 7617

页数：5

共 50 条

[1] End-to-end audio-visual speech recognition for overlapping speech
Rose, Richard
Siohan, Olivier
Tripathi, Anshuman
Braga, Otavio
INTERSPEECH 2021, 2021, : 3016 - 3020
[2] An Improved End-to-End Audio-Visual Speech Recognition Model
Yang, Sheng
Gong, Zheng
Kang, Jia
INTERSPEECH 2023, 2023, : 3093 - 3097
[3] MODALITY ATTENTION FOR END-TO-END AUDIO-VISUAL SPEECH RECOGNITION
Zhou, Pan
Yang, Wenwen
Chen, Wei
Wang, Yanfeng
Jia, Jia
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6565 - 6569
[4] FUSING INFORMATION STREAMS IN END-TO-END AUDIO-VISUAL SPEECH RECOGNITION
Yu, Wentao
Zeiler, Steffen
Kolossa, Dorothea
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 3430 - 3434
[5] Investigating the Lombard Effect Influence on End-to-End Audio-Visual Speech Recognition
Ma, Pingchuan
Petridis, Stavros
Pantic, Maja
INTERSPEECH 2019, 2019, : 4090 - 4094
[6] Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition
Hong, Joanna
Kim, Minsu
Yoo, Daehun
Ro, Yong Man
INTERSPEECH 2022, 2022, : 2838 - 2842
[7] Audio-Visual End-to-End Multi-Channel Speech Separation, Dereverberation and Recognition
Li, Guinan
Deng, Jiajun
Geng, Mengzhe
Jin, Zengrui
Wang, Tianzi
Hu, Shujie
Cui, Mingyu
Meng, Helen
Liu, Xunying
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 2707 - 2723
[8] End-to-End Bloody Video Recognition by Audio-Visual Feature Fusion
Hou, Congcong
Wu, Xiaoyu
Wang, Ge
PATTERN RECOGNITION AND COMPUTER VISION (PRCV 2018), PT I, 2018, 11256 : 501 - 510
[9] End-to-End Audio-Visual Neural Speaker Diarization
He, Mao-kui
Du, Jun
Lee, Chin-Hui
INTERSPEECH 2022, 2022, : 1461 - 1465
[10] END-TO-END MULTI-PERSON AUDIO/VISUAL AUTOMATIC SPEECH RECOGNITION
Braga, Otavio
Makino, Takaki
Siohan, Olivier
Liao, Hank
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6994 - 6998

← 1 2 3 4 5 →