LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading

被引：7

作者：

Qu, Leyuan ^{[1
]}

Weber, Cornelius ^{[1
]}

Wermter, Stefan ^{[1
]}

机构：

[1] Univ Hamburg, Dept Informat, Knowledge Technol Inst, D-22527 Hamburg, Germany

来源：

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS | 2024年 / 35卷 / 02期

关键词：

Lips; Speech recognition; Visualization; Videos; Image reconstruction; Face recognition; Vocabulary; Lip reading; self-supervised pre-training; speech recognition; speech reconstruction; RECOGNITION;

D O I：

10.1109/TNNLS.2022.3191677

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The aim of this work is to investigate the impact of crossmodal self-supervised pre-training for speech reconstruction (video-to-audio) by leveraging the natural co-occurrence of audio and visual streams in videos. We propose LipSound2 that consists of an encoder-decoder architecture and location-aware attention mechanism to map face image sequences to mel-scale spectrograms directly without requiring any human annotations. The proposed LipSound2 model is first pre-trained on similar to 2400-h multilingual (e.g., English and German) audio-visual data (VoxCeleb2). To verify the generalizability of the proposed method, we then fine-tune the pre-trained model on domain-specific datasets (GRID and TCD-TIMIT) for English speech reconstruction and achieve a significant improvement on speech quality and intelligibility compared to previous approaches in speaker-dependent and speaker-independent settings. In addition to English, we conduct Chinese speech reconstruction on the Chinese Mandarin Lip Reading (CMLR) dataset to verify the impact on transferability. Finally, we train the cascaded lip reading (video-to-text) system by fine-tuning the generated audios on a pre-trained speech recognition system and achieve the state-of-the-art performance on both English and Chinese benchmark datasets.

引用

页码：2772 / 2782

页数：11

共 93 条

[1] Deep Audio-Visual Speech Recognition
Afouras, Triantafyllos
Chung, Joon Son
Senior, Andrew
Vinyals, Oriol
Zisserman, Andrew
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) : 8717 - 8727
[2] Deep Lip Reading: a comparison of models and an online application
Afouras, Triantafyllos
Chung, Joon Son
Zisserman, Andrew
[J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 3514 - 3518
[3] Towards reconstructing intelligible speech from the human auditory cortex
Akbari, Hassan
Khalighinejad, Bahar
Herrero, Jose L.
Mehta, Ashesh D.
Mesgarani, Nima
[J]. SCIENTIFIC REPORTS, 2019, 9 (1)
[4] Akbari H, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P2516, DOI 10.1109/ICASSP.2018.8461856
[5] Look, Listen and Learn
Arandjelovic, Relja
Zisserman, Andrew
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 609 - 617
[6] Objects that Sound
Arandjelovic, Relja
Zisserman, Andrew
[J]. COMPUTER VISION - ECCV 2018, PT I, 2018, 11205 : 451 - 466
[7] Assael, 2017, PROC GPU TECHNOL C, P1
[8] Bengio S, 2015, ADV NEUR IN, V28
[9] Bimodal speech: early suppressive visual effects in human auditory cortex
Besle, J
Fort, A
Delpuech, C
Giard, MH
[J]. EUROPEAN JOURNAL OF NEUROSCIENCE, 2004, 20 (08) : 2225 - 2234
[10] Chen Beidi, 2021, arXiv

← 1 2 3 4 5 6 7 8 9 10 →