MusicFace: Music-driven expressive singing face synthesis

被引:5
作者
Liu, Pengfei [1 ]
Deng, Wenjin [1 ]
Li, Hengda [1 ]
Wang, Jintai [1 ]
Zheng, Yinglin [1 ]
Ding, Yiwei [1 ]
Guo, Xiaohu [2 ]
Zeng, Ming [1 ]
机构
[1] Xiamen Univ, Sch Informat, Xiamen 361000, Peoples R China
[2] Univ Texas Dallas, Dept Comp Sci, Richardson, TX 75080 USA
来源
COMPUTATIONAL VISUAL MEDIA | 2024年 / 10卷 / 01期
基金
美国国家科学基金会; 国家重点研发计划; 中国国家自然科学基金;
关键词
face synthesis; singing; music; generative adversarial network; VIRTUAL HEAD; VIDEO; TEXT;
D O I
10.1007/s41095-023-0343-7
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
It remains an interesting and challenging problem to synthesize a vivid and realistic singing face driven by music. In this paper, we present a method for this task with natural motions for the lips, facial expression, head pose, and eyes. Due to the coupling of mixed information for the human voice and backing music in common music audio signals, we design a decouple-and-fuse strategy to tackle the challenge. We first decompose the input music audio into a human voice stream and a backing music stream. Due to the implicit and complicated correlation between the two-stream input signals and the dynamics of the facial expressions, head motions, and eye states, we model their relationship with an attention scheme, where the effects of the two streams are fused seamlessly. Furthermore, to improve the expressivenes of the generated results, we decompose head movement generation in terms of speed and direction, and decompose eye state generation into short-term blinking and long-term eye closing, modeling them separately. We have also built a novel dataset, SingingFace, to support training and evaluation of models for this task, including future work on this topic. Extensive experiments and a user study show that our proposed method is capable of synthesizing vivid singing faces, qualitatively and quantitatively better than the prior state-of-the-art.
引用
收藏
页码:119 / 136
页数:18
相关论文
共 69 条
[1]  
Alemi O., 2017, networks, V8, P26
[2]  
Baltrusaitis T, 2016, IEEE WINT CONF APPL
[3]  
Brand M, 1999, COMP GRAPH, P21, DOI 10.1145/311535.311537
[4]  
Bregler C., 1997, Computer Graphics Proceedings, SIGGRAPH 97, P353, DOI 10.1145/258734.258880
[5]   FaceWarehouse: A 3D Facial Expression Database for Visual Computing [J].
Cao, Chen ;
Weng, Yanlin ;
Zhou, Shun ;
Tong, Yiying ;
Zhou, Kun .
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2014, 20 (03) :413-425
[6]   Music-driven motion editing: Local motion transformations guided by music analysis [J].
Cardle, M ;
Barthe, L ;
Brooks, S ;
Robinson, P .
20TH EUROGRAPHICS UK CONFERENCE, PROCEEDINGS, 2002, :38-44
[7]   Hierarchical Cross-Modal Talking Face Generation with Dynamic Pixel-Wise Loss [J].
Chen, Lele ;
Maddox, Ross K. ;
Duan, Zhiyao ;
Xu, Chenliang .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :7824-7833
[8]   Talking-Head Generation with Rhythmic Head Motion [J].
Chen, Lele ;
Cui, Guofeng ;
Liu, Celong ;
Li, Zhong ;
Kou, Ziyi ;
Xu, Yi ;
Xu, Chenliang .
COMPUTER VISION - ECCV 2020, PT IX, 2020, 12354 :35-51
[9]   Lip Movements Generation at a Glance [J].
Chen, Lele ;
Li, Zhiheng ;
Maddox, Ross K. ;
Duan, Zhiyao ;
Xu, Chenliang .
COMPUTER VISION - ECCV 2018, PT VII, 2018, 11211 :538-553
[10]  
Chung J. S., 2018, arXiv