Music Conditioned Generation for Human-Centric Video

被引：0

作者：

Zhao, Zimeng ^{[1
]}

Zuo, Binghui ^{[1
]}

Wang, Yangang ^{[1
]}

机构：

[1] Southeast Univ, Sch Automat, Key Lab Measurement & Control Complex Syst Engn, Minist Educ, Nanjing 210096, Peoples R China

来源：

IEEE SIGNAL PROCESSING LETTERS | 2024年 / 31卷

基金：

中国国家自然科学基金;

关键词：

Multiple signal classification; Generative adversarial networks; Correlation; Visualization; Training; Task analysis; Feature extraction; Video generation; signal processing; cross-modal learning; human-centric;

D O I：

10.1109/LSP.2024.3358978

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Music and human-centric video are two fundamental signals across languages. Correlation analysis between the two is currently used in choreography and film accompaniment. This letter explores this correlation in a new task: human-centric video generation from a start-end image pair and transitional music. Existing human-centric generation methods are not competent for this task because they require frame-wise pose as input or have difficulty handling long-duration videos. Our key idea is to build a temporal generation framework dominated by DDPM and assisted by VAE and GAN. It reduces the computational cost of music-image diffusion by utilizing the latent space compactness of VAE and the image translation efficiency of GAN. To produce videos with both long duration and high quality, our framework first generates small-scale keyframes and then generates high-resolution videos. To strengthen the frame-wise consistency of the human body, a frame-aligned correspondence map is adopted as an intermediate supervision. Extensive experiments compared with the SOTA method have demonstrated the rationality and effectiveness of this signal generation framework.

引用

页码：506 / 510

页数：5

共 34 条

[1] Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models
Alexanderson, Simon
Nagy, Rajmund
Beskow, Jonas
Henter, Gustav Eje
[J]. ACM TRANSACTIONS ON GRAPHICS, 2023, 42 (04):
[2] A Unified 3D Human Motion Synthesis Model via Conditional Variational Auto-Encoder
Cai, Yujun
Wang, Yiwei
Zhu, Yiheng
Cham, Tat-Jen
Cai, Jianfei
Yuan, Junsong
Liu, Jun
Zheng, Chuanxia
Yan, Sijie
Ding, Henghui
Shen, Xiaohui
Liu, Ding
Thalmann, Nadia Magnenat
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 11625 - 11635
[3] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Carreira, Joao
Zisserman, Andrew
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
[4] Defossez A., 2023, Trans. Mach. Learn. Res., P1
[5] Doh S., 2023, P 24 INT SOC MUS INF, P409, DOI DOI 10.5281/ZENODO.10265311
[6] Generative Adversarial Networks
Goodfellow, Ian
Pouget-Abadie, Jean
Mirza, Mehdi
Xu, Bing
Warde-Farley, David
Ozair, Sherjil
Courville, Aaron
Bengio, Yoshua
[J]. COMMUNICATIONS OF THE ACM, 2020, 63 (11) : 139 - 144
[7] Graves A, 2012, STUD COMPUT INTELL, V385, P1, DOI [10.1007/978-3-642-24797-2, 10.1162/neco.1997.9.1.1]
[8] DensePose: Dense Human Pose Estimation In The Wild
Guler, Riza Alp
Neverova, Natalia
Kokkinos, Lasonas
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7297 - 7306
[9] He YQ, 2023, Arxiv, DOI arXiv:2211.13221
[10] Ho J., 2020, Advances in Neural Information Processing Systems, V33, P6840

← 1 2 3 4 →