LPIPS-AttnWav2Lip: Generic audio-driven lip synchronization for talking head generation in the wild

被引:3
作者
Chen, Zhipeng [1 ]
Wang, Xinheng [1 ]
Xie, Lun [1 ]
Yuan, Haijie [2 ]
Pan, Hang [3 ]
机构
[1] Univ Sci & Technol Beijing, Beijing 100083, Peoples R China
[2] Xiaoduo Intelligent Technol Beijing Co Ltd, Beijing 100094, Peoples R China
[3] Changzhi Univ, Dept Comp Sci, Changzhi 046011, Peoples R China
基金
北京市自然科学基金;
关键词
Audio-driven generation; Lip synthesis; LPIPS loss; Multimodal fusion; Talking head generation;
D O I
10.1016/j.specom.2023.103028
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Researchers have shown a growing interest in Audio -driven Talking Head Generation. The primary challenge in talking head generation is achieving audio-visual coherence between the lips and the audio, known as lip synchronization. This paper proposes a generic method, LPIPS-AttnWav2Lip, for reconstructing face images of any speaker based on audio. We used the U -Net architecture based on residual CBAM to better encode and fuse audio and visual modal information. Additionally, the semantic alignment module extends the receptive field of the generator network to obtain the spatial and channel information of the visual features efficiently; and match statistical information of visual features with audio latent vector to achieve the adjustment and injection of the audio content information to the visual information. To achieve exact lip synchronization and to generate realistic high -quality images, our approach adopts LPIPS Loss, which simulates human judgment of image quality and reduces instability possibility during the training process. The proposed method achieves outstanding performance in terms of lip synchronization accuracy and visual quality as demonstrated by subjective and objective evaluation results.
引用
收藏
页数:8
相关论文
共 33 条
[11]   StyleSync: High-Fidelity Generalized and Personalized Lip Sync in Style-based Generator [J].
Guan, Jiazhi ;
Zhang, Zhanwang ;
Zhou, Hang ;
Hu, Tianshu ;
Wang, Kaisiyuan ;
He, Dongliang ;
Feng, Haocheng ;
Liu, Jingtuo ;
Ding, Errui ;
Liu, Ziwei ;
Wang, Jingdong .
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, :1505-1515
[12]   AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis [J].
Guo, Yudong ;
Chen, Keyu ;
Liang, Sen ;
Liu, Yong-Jin ;
Bao, Hujun ;
Zhang, Juyong .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :5764-5774
[13]   Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization [J].
Huang, Xun ;
Belongie, Serge .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :1510-1519
[14]   Image-to-Image Translation with Conditional Adversarial Networks [J].
Isola, Phillip ;
Zhu, Jun-Yan ;
Zhou, Tinghui ;
Efros, Alexei A. .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :5967-5976
[15]  
Kumar R., 2017, arXiv
[16]  
Lei N, 2019, Arxiv, DOI arXiv:1902.02934
[17]   A Novel Speech-Driven Lip-Sync Model with CNN and LSTM [J].
Li, Xiaohong ;
Wang, Xiang ;
Wang, Kai ;
Lian, Shiguo .
2021 14TH INTERNATIONAL CONGRESS ON IMAGE AND SIGNAL PROCESSING, BIOMEDICAL ENGINEERING AND INFORMATICS (CISP-BMEI 2021), 2021,
[18]   A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild [J].
Prajwal, K. R. ;
Mukhopadhyay, Rudrabha ;
Namboodiri, Vinay P. ;
Jawahar, C. V. .
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, :484-492
[19]   Towards Automatic Face-to-Face Translation [J].
Prajwal, K. R. ;
Mukhopadhyay, Rudrabha ;
Philip, Jerin ;
Jha, Abhishek ;
Namboodiri, Vinay ;
Jawahar, C. V. .
PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, :1428-1436
[20]   U-Net: Convolutional Networks for Biomedical Image Segmentation [J].
Ronneberger, Olaf ;
Fischer, Philipp ;
Brox, Thomas .
MEDICAL IMAGE COMPUTING AND COMPUTER-ASSISTED INTERVENTION, PT III, 2015, 9351 :234-241