LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading

被引:7
作者
Qu, Leyuan [1 ]
Weber, Cornelius [1 ]
Wermter, Stefan [1 ]
机构
[1] Univ Hamburg, Dept Informat, Knowledge Technol Inst, D-22527 Hamburg, Germany
关键词
Lips; Speech recognition; Visualization; Videos; Image reconstruction; Face recognition; Vocabulary; Lip reading; self-supervised pre-training; speech recognition; speech reconstruction; RECOGNITION;
D O I
10.1109/TNNLS.2022.3191677
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The aim of this work is to investigate the impact of crossmodal self-supervised pre-training for speech reconstruction (video-to-audio) by leveraging the natural co-occurrence of audio and visual streams in videos. We propose LipSound2 that consists of an encoder-decoder architecture and location-aware attention mechanism to map face image sequences to mel-scale spectrograms directly without requiring any human annotations. The proposed LipSound2 model is first pre-trained on similar to 2400-h multilingual (e.g., English and German) audio-visual data (VoxCeleb2). To verify the generalizability of the proposed method, we then fine-tune the pre-trained model on domain-specific datasets (GRID and TCD-TIMIT) for English speech reconstruction and achieve a significant improvement on speech quality and intelligibility compared to previous approaches in speaker-dependent and speaker-independent settings. In addition to English, we conduct Chinese speech reconstruction on the Chinese Mandarin Lip Reading (CMLR) dataset to verify the impact on transferability. Finally, we train the cascaded lip reading (video-to-text) system by fine-tuning the generated audios on a pre-trained speech recognition system and achieve the state-of-the-art performance on both English and Chinese benchmark datasets.
引用
收藏
页码:2772 / 2782
页数:11
相关论文
共 93 条
  • [1] Deep Audio-Visual Speech Recognition
    Afouras, Triantafyllos
    Chung, Joon Son
    Senior, Andrew
    Vinyals, Oriol
    Zisserman, Andrew
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) : 8717 - 8727
  • [2] Deep Lip Reading: a comparison of models and an online application
    Afouras, Triantafyllos
    Chung, Joon Son
    Zisserman, Andrew
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 3514 - 3518
  • [3] Towards reconstructing intelligible speech from the human auditory cortex
    Akbari, Hassan
    Khalighinejad, Bahar
    Herrero, Jose L.
    Mehta, Ashesh D.
    Mesgarani, Nima
    [J]. SCIENTIFIC REPORTS, 2019, 9 (1)
  • [4] Akbari H, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P2516, DOI 10.1109/ICASSP.2018.8461856
  • [5] Look, Listen and Learn
    Arandjelovic, Relja
    Zisserman, Andrew
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 609 - 617
  • [6] Objects that Sound
    Arandjelovic, Relja
    Zisserman, Andrew
    [J]. COMPUTER VISION - ECCV 2018, PT I, 2018, 11205 : 451 - 466
  • [7] Assael, 2017, PROC GPU TECHNOL C, P1
  • [8] Bengio S, 2015, ADV NEUR IN, V28
  • [9] Bimodal speech: early suppressive visual effects in human auditory cortex
    Besle, J
    Fort, A
    Delpuech, C
    Giard, MH
    [J]. EUROPEAN JOURNAL OF NEUROSCIENCE, 2004, 20 (08) : 2225 - 2234
  • [10] Chen Beidi, 2021, arXiv