Speech4Mesh: Speech-Assisted Monocular 3D Facial Reconstruction for Speech-Driven 3D Facial Animation

被引：3

作者：

He, Shan ^{[1
,2
]}

He, Haonan ^{[2
]}

Yang, Shuo ^{[2
]}

Wu, Xiaoyan ^{[2
]}

Xia, Pengcheng ^{[2
]}

Yin, Bing ^{[2
]}

Liu, Cong ^{[2
]}

Dai, Lirong ^{[1
]}

Xu, Chang ^{[3
]}

机构：

[1] Univ Sci & Technol China, Hefei, Peoples R China

[2] iFLYTEK Res, Hefei, Peoples R China

[3] Univ Sydney, Camperdown, NSW, Australia

来源：

2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023) | 2023年

关键词：

D O I：

10.1109/ICCV51070.2023.01305

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent audio2mesh-based methods have shown promising prospects for speech-driven 3D facial animation tasks. However, some intractable challenges are urgent to be settled. For example, the data-scarcity problem is intrinsically inevitable due to the difficulty of 4D data collection. Besides, current methods generally lack controllability on the animated face. To this end, we propose a novel framework named Speech4Mesh to consecutively generate 4D talking head data and train the audio2mesh network with the reconstructed meshes. In our framework, we first reconstruct the 4D talking head sequence based on the monocular videos. For precise capture of the talking-related variation on the face, we exploit the audio-visual alignment information from the video by employing a contrastive learning scheme. We next can train the audio2mesh network (e.g., FaceFormer) based on the generated 4D data. To get control of the animated talking face, we encode the speaking-unrelated factors (e.g., emotion, etc.) into an emotion embedding for manipulation. Finally, a differentiable renderer guarantees more accurate photometric details of the reconstruction and animation results. Empirical experiments demonstrate that the Speech4Mesh framework can not only outperform state-of-the-art reconstruction methods, especially on the lower-face part but also achieve better animation performance both perceptually and objectively after pre-trained on the synthesized data. Besides, we also verify that the proposed framework is able to explicitly control the emotion of the animated talking face.

引用

页码：14146 / 14156

页数：11

共 76 条

[51] 3D Face Reconstruction by Learning from Synthetic Data [J].

Richardson, Elad ;

Sela, Matan ;

Kimmel, Ron .

PROCEEDINGS OF 2016 FOURTH INTERNATIONAL CONFERENCE ON 3D VISION (3DV), 2016, :460-467

[52]

Romdhani S, 2005, PROC CVPR IEEE, P986

[53] Real-Time Facial Segmentation and Performance Capture from RGB Input [J].

Saito, Shunsuke ;

Li, Tianye ;

Li, Hao .

COMPUTER VISION - ECCV 2016, PT VIII, 2016, 9912 :244-261

[54] Learning to Regress 3D Face Shape and Expression from an Image without 3D Supervision [J].

Sanyal, Soubhik ;

Bolkart, Timo ;

Feng, Haiwen ;

Black, Michael J. .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :7755-7764

[55] Unrestricted Facial Geometry Reconstruction Using Image-to-Image Translation [J].

Sela, Matan ;

Richardson, Elad ;

Kimmel, Ron .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :1585-1594

[56] Self-Supervised Monocular 3D Face Reconstruction by Occlusion-Aware Multi-view Geometry Consistency [J].

Shang, Jiaxiang ;

Shen, Tianwei ;

Li, Shiwei ;

Zhou, Lei ;

Zhen, Mingmin ;

Fang, Tian ;

Quan, Long .

COMPUTER VISION - ECCV 2020, PT XV, 2020, 12360 :53-70

[57] Synthesizing Obama: Learning Lip Sync from Audio [J].

Suwajanakorn, Supasorn ;

Seitz, Steven M. ;

Kemelmacher-Shlizerman, Ira .

ACM TRANSACTIONS ON GRAPHICS, 2017, 36 (04)

[58]

Taylor Sarah L, 2012, ACM SIGGRAPH EUROGRA, P275, DOI DOI 10.2312/SCA/SCA12/275-284

[59] FML: Face Model Learning from Videos [J].

Tewari, Ayush ;

Bernard, Florian ;

Garrido, Pablo ;

Bharaj, Gaurav ;

Elgharib, Mohamed ;

Seidel, Hans-Peter ;

Perez, Patrick ;

Zollhofer, Michael ;

Theobalt, Christian .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :10804-10814

[60] Self-supervised Multi-level Face Model Learning for Monocular Reconstruction at over 250 Hz [J].

Tewari, Ayush ;

Zollhofer, Michael ;

Garrido, Pablo ;

Bernard, Florian ;

Kim, Hyeongwoo ;

Perez, Patrick ;

Theobalt, Christian .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :2549-2559

← 1 2 3 4 5 6 7 8 →