Music Conditioned Generation for Human-Centric Video

被引:0
作者
Zhao, Zimeng [1 ]
Zuo, Binghui [1 ]
Wang, Yangang [1 ]
机构
[1] Southeast Univ, Sch Automat, Key Lab Measurement & Control Complex Syst Engn, Minist Educ, Nanjing 210096, Peoples R China
基金
中国国家自然科学基金;
关键词
Multiple signal classification; Generative adversarial networks; Correlation; Visualization; Training; Task analysis; Feature extraction; Video generation; signal processing; cross-modal learning; human-centric;
D O I
10.1109/LSP.2024.3358978
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Music and human-centric video are two fundamental signals across languages. Correlation analysis between the two is currently used in choreography and film accompaniment. This letter explores this correlation in a new task: human-centric video generation from a start-end image pair and transitional music. Existing human-centric generation methods are not competent for this task because they require frame-wise pose as input or have difficulty handling long-duration videos. Our key idea is to build a temporal generation framework dominated by DDPM and assisted by VAE and GAN. It reduces the computational cost of music-image diffusion by utilizing the latent space compactness of VAE and the image translation efficiency of GAN. To produce videos with both long duration and high quality, our framework first generates small-scale keyframes and then generates high-resolution videos. To strengthen the frame-wise consistency of the human body, a frame-aligned correspondence map is adopted as an intermediate supervision. Extensive experiments compared with the SOTA method have demonstrated the rationality and effectiveness of this signal generation framework.
引用
收藏
页码:506 / 510
页数:5
相关论文
共 34 条
  • [1] Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models
    Alexanderson, Simon
    Nagy, Rajmund
    Beskow, Jonas
    Henter, Gustav Eje
    [J]. ACM TRANSACTIONS ON GRAPHICS, 2023, 42 (04):
  • [2] A Unified 3D Human Motion Synthesis Model via Conditional Variational Auto-Encoder
    Cai, Yujun
    Wang, Yiwei
    Zhu, Yiheng
    Cham, Tat-Jen
    Cai, Jianfei
    Yuan, Junsong
    Liu, Jun
    Zheng, Chuanxia
    Yan, Sijie
    Ding, Henghui
    Shen, Xiaohui
    Liu, Ding
    Thalmann, Nadia Magnenat
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 11625 - 11635
  • [3] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
    Carreira, Joao
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
  • [4] Defossez A., 2023, Trans. Mach. Learn. Res., P1
  • [5] Doh S., 2023, P 24 INT SOC MUS INF, P409, DOI DOI 10.5281/ZENODO.10265311
  • [6] Generative Adversarial Networks
    Goodfellow, Ian
    Pouget-Abadie, Jean
    Mirza, Mehdi
    Xu, Bing
    Warde-Farley, David
    Ozair, Sherjil
    Courville, Aaron
    Bengio, Yoshua
    [J]. COMMUNICATIONS OF THE ACM, 2020, 63 (11) : 139 - 144
  • [7] Graves A, 2012, STUD COMPUT INTELL, V385, P1, DOI [10.1007/978-3-642-24797-2, 10.1162/neco.1997.9.1.1]
  • [8] DensePose: Dense Human Pose Estimation In The Wild
    Guler, Riza Alp
    Neverova, Natalia
    Kokkinos, Lasonas
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7297 - 7306
  • [9] He YQ, 2023, Arxiv, DOI arXiv:2211.13221
  • [10] Ho J., 2020, Advances in Neural Information Processing Systems, V33, P6840