DaGAN plus plus : Depth-Aware Generative Adversarial Network for Talking Head Video Generation

被引:2
作者
Hong, Fa-Ting [1 ]
Shen, Li [2 ]
Xu, Dan [1 ]
机构
[1] Hong Kong Univ Sci & Technol, Dept Comp Sci & Engn, Hong Kong, Peoples R China
[2] Alibaba Grp, Hangzhou 310052, Peoples R China
关键词
Faces; Head; Three-dimensional displays; Geometry; Magnetic heads; Estimation; Annotations; Talking head generation; self-supervised facial depth estimation; geometry-guided video generation; IMAGE;
D O I
10.1109/TPAMI.2023.3339964
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Predominant techniques on talking head generation largely depend on 2D information, including facial appearances and motions from input face images. Nevertheless, dense 3D facial geometry, such as pixel-wise depth, plays a critical role in constructing accurate 3D facial structures and suppressing complex background noises for generation. However, dense 3D annotations for facial videos is prohibitively costly to obtain. In this paper, first, we present a novel self-supervised method for learning dense 3D facial geometry (i.e., depth) from face videos, without requiring camera parameters and 3D geometry annotations in training. We further propose a strategy to learn pixel-level uncertainties to perceive more reliable rigid-motion pixels for geometry learning. Second, we design an effective geometry-guided facial keypoint estimation module, providing accurate keypoints for generating motion fields. Lastly, we develop a 3D-aware cross-modal (i.e., appearance and depth) attention mechanism, which can be applied to each generation layer, to capture facial geometries in a coarse-to-fine manner. Extensive experiments are conducted on three challenging benchmarks (i.e., VoxCeleb1, VoxCeleb2, and HDTF). The results demonstrate that our proposed framework can generate highly realistic-looking reenacted talking videos, with new state-of-the-art performances established on these benchmarks.
引用
收藏
页码:2997 / 3012
页数:16
相关论文
共 69 条
  • [1] A morphable model for the synthesis of 3D faces
    Blanz, V
    Vetter, T
    [J]. SIGGRAPH 99 CONFERENCE PROCEEDINGS, 1999, : 187 - 194
  • [2] A 3D Morphable Model learnt from 10,000 faces
    Booth, James
    Roussos, Anastasios
    Zafeiriou, Stefanos
    Ponniah, Allan
    Dunaway, David
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 5543 - 5552
  • [3] HyperReenact: One-Shot Reenactment via Jointly Learning to Refine and Retarget Faces
    Bounareli, Stella
    Tzelepis, Christos
    Argyriou, Vasileios
    Patras, Ioannis
    Tzimiropoulos, Georgios
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 7115 - 7125
  • [4] Bradski G, 2000, DR DOBBS J, V25, P120
  • [5] Burkov E, 2020, PROC CVPR IEEE, P13783, DOI 10.1109/CVPR42600.2020.01380
  • [6] Deep Cross-Modal Audio-Visual Generation
    Chen, Lele
    Srivastava, Sudhanshu
    Duan, Zhiyao
    Xu, Chenliang
    [J]. PROCEEDINGS OF THE THEMATIC WORKSHOPS OF ACM MULTIMEDIA 2017 (THEMATIC WORKSHOPS'17), 2017, : 349 - 357
  • [7] Chuang Gan, 2020, Computer Vision - ECCV 2020 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12356), P758, DOI 10.1007/978-3-030-58621-8_44
  • [8] Chuang Gan, 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Proceedings, P10475, DOI 10.1109/CVPR42600.2020.01049
  • [9] Chung JS, 2018, INTERSPEECH, P1086
  • [10] Accurate 3D Face Reconstruction with Weakly-Supervised Learning: From Single Image to Image Set
    Deng, Yu
    Yang, Jiaolong
    Xu, Sicheng
    Chen, Dong
    Jia, Yunde
    Tong, Xin
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2019), 2019, : 285 - 295