Implicit Memory-Based Variational Motion Talking Face Generation

被引：2

作者：

Yang, Daowu ^{[1
]}

Huang, Sheng ^{[1
]}

Jiang, Wen ^{[1
]}

Zou, Jin ^{[1
]}

机构：

[1] Hunan Int Econ Univ, Changsha 410205, Peoples R China

来源：

IEEE SIGNAL PROCESSING LETTERS | 2024年 / 31卷

关键词：

Implicit memory; speech-driven facial; audio-to-motion;

D O I：

10.1109/LSP.2024.3356415

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Speech-driven facial animation is a challenging problem where each input audio can have multiple plausible facial outputs, leading to overly smooth results. Although the two-stage framework of audio-to-motion model and neural rendering models can partially mitigate this issue, it lacks crucial details like emotions and wrinkles. To overcome these limitations, we introduce a variational motion generator with implicit memory. By incorporating implicit memory into the audio-to-motion model, we capture high-level semantics in the shared latent space of audio expressions, resulting in accurate and expressive facial landmark generation. Next, we introduce attention with time bias to effectively maintain the consistency of audio motion and adopt a periodic position encoding strategy to provide summarization capability for longer audio sequences. Experimental results demonstrate that our approach outperforms previous methods, yielding more extensive and realistic speech-driven facial animation.

引用

页码：431 / 435

页数：5

共 30 条

[1]

Afouras T, 2018, Arxiv, DOI arXiv:1809.00496

[2] A morphable model for the synthesis of 3D faces [J].

Blanz, V ;

Vetter, T .

SIGGRAPH 99 CONFERENCE PROCEEDINGS, 1999, :187-194

[3] Lip Movements Generation at a Glance [J].

Chen, Lele ;

Li, Zhiheng ;

Maddox, Ross K. ;

Duan, Zhiyao ;

Xu, Chenliang .

COMPUTER VISION - ECCV 2018, PT VII, 2018, 11211 :538-553

[4] Capture, Learning, and Synthesis of 3D Speaking Styles [J].

Cudeiro, Daniel ;

Bolkart, Timo ;

Laidlaw, Cassidy ;

Ranjan, Anurag ;

Black, Michael J. .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :10093-10103

[5] Accurate 3D Face Reconstruction with Weakly-Supervised Learning: From Single Image to Image Set [J].

Deng, Yu ;

Yang, Jiaolong ;

Xu, Sicheng ;

Chen, Dong ;

Jia, Yunde ;

Tong, Xin .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2019), 2019, :285-295

[6] FaceFormer: Speech-Driven 3D Facial Animation with Transformers [J].

Fan, Yingruo ;

Lin, Zhaojiang ;

Saito, Jun ;

Wang, Wenping ;

Komura, Taku .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :18749-18758

[7] AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis [J].

Guo, Yudong ;

Chen, Keyu ;

Liang, Sen ;

Liu, Yong-Jin ;

Bao, Hujun ;

Zhang, Juyong .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :5764-5774

[8]

Heusel M, 2017, ADV NEUR IN, V30

[9] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units [J].

Hsu, Wei-Ning ;

Bolte, Benjamin ;

Tsai, Yao-Hung Hubert ;

Lakhotia, Kushal ;

Salakhutdinov, Ruslan ;

Mohamed, Abdelrahman .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 :3451-3460

[10] Scope of validity of PSNR in image/video quality assessment [J].

Huynh-Thu, Q. ;

Ghanbari, M. .

ELECTRONICS LETTERS, 2008, 44 (13) :800-U35

← 1 2 3 →