DaGAN plus plus : Depth-Aware Generative Adversarial Network for Talking Head Video Generation

被引：2

作者：

Hong, Fa-Ting ^{[1
]}

Shen, Li ^{[2
]}

Xu, Dan ^{[1
]}

机构：

[1] Hong Kong Univ Sci & Technol, Dept Comp Sci & Engn, Hong Kong, Peoples R China

[2] Alibaba Grp, Hangzhou 310052, Peoples R China

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2024年 / 46卷 / 05期

关键词：

Faces; Head; Three-dimensional displays; Geometry; Magnetic heads; Estimation; Annotations; Talking head generation; self-supervised facial depth estimation; geometry-guided video generation; IMAGE;

D O I：

10.1109/TPAMI.2023.3339964

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Predominant techniques on talking head generation largely depend on 2D information, including facial appearances and motions from input face images. Nevertheless, dense 3D facial geometry, such as pixel-wise depth, plays a critical role in constructing accurate 3D facial structures and suppressing complex background noises for generation. However, dense 3D annotations for facial videos is prohibitively costly to obtain. In this paper, first, we present a novel self-supervised method for learning dense 3D facial geometry (i.e., depth) from face videos, without requiring camera parameters and 3D geometry annotations in training. We further propose a strategy to learn pixel-level uncertainties to perceive more reliable rigid-motion pixels for geometry learning. Second, we design an effective geometry-guided facial keypoint estimation module, providing accurate keypoints for generating motion fields. Lastly, we develop a 3D-aware cross-modal (i.e., appearance and depth) attention mechanism, which can be applied to each generation layer, to capture facial geometries in a coarse-to-fine manner. Extensive experiments are conducted on three challenging benchmarks (i.e., VoxCeleb1, VoxCeleb2, and HDTF). The results demonstrate that our proposed framework can generate highly realistic-looking reenacted talking videos, with new state-of-the-art performances established on these benchmarks.

引用

页码：2997 / 3012

页数：16

共 69 条

[1] A morphable model for the synthesis of 3D faces
Blanz, V
Vetter, T
[J]. SIGGRAPH 99 CONFERENCE PROCEEDINGS, 1999, : 187 - 194
[2] A 3D Morphable Model learnt from 10,000 faces
Booth, James
Roussos, Anastasios
Zafeiriou, Stefanos
Ponniah, Allan
Dunaway, David
[J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 5543 - 5552
[3] HyperReenact: One-Shot Reenactment via Jointly Learning to Refine and Retarget Faces
Bounareli, Stella
Tzelepis, Christos
Argyriou, Vasileios
Patras, Ioannis
Tzimiropoulos, Georgios
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 7115 - 7125
[4] Bradski G, 2000, DR DOBBS J, V25, P120
[5] Burkov E, 2020, PROC CVPR IEEE, P13783, DOI 10.1109/CVPR42600.2020.01380
[6] Deep Cross-Modal Audio-Visual Generation
Chen, Lele
Srivastava, Sudhanshu
Duan, Zhiyao
Xu, Chenliang
[J]. PROCEEDINGS OF THE THEMATIC WORKSHOPS OF ACM MULTIMEDIA 2017 (THEMATIC WORKSHOPS'17), 2017, : 349 - 357
[7] Chuang Gan, 2020, Computer Vision - ECCV 2020 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12356), P758, DOI 10.1007/978-3-030-58621-8_44
[8] Chuang Gan, 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Proceedings, P10475, DOI 10.1109/CVPR42600.2020.01049
[9] Chung JS, 2018, INTERSPEECH, P1086
[10] Accurate 3D Face Reconstruction with Weakly-Supervised Learning: From Single Image to Image Set
Deng, Yu
Yang, Jiaolong
Xu, Sicheng
Chen, Dong
Jia, Yunde
Tong, Xin
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2019), 2019, : 285 - 295

← 1 2 3 4 5 6 7 →