Music Conditioned Generation for Human-Centric Video

被引：0

作者：

Zhao, Zimeng ^{[1
]}

Zuo, Binghui ^{[1
]}

Wang, Yangang ^{[1
]}

机构：

[1] Southeast Univ, Sch Automat, Key Lab Measurement & Control Complex Syst Engn, Minist Educ, Nanjing 210096, Peoples R China

来源：

IEEE SIGNAL PROCESSING LETTERS | 2024年 / 31卷

基金：

中国国家自然科学基金;

关键词：

Multiple signal classification; Generative adversarial networks; Correlation; Visualization; Training; Task analysis; Feature extraction; Video generation; signal processing; cross-modal learning; human-centric;

D O I：

10.1109/LSP.2024.3358978

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Music and human-centric video are two fundamental signals across languages. Correlation analysis between the two is currently used in choreography and film accompaniment. This letter explores this correlation in a new task: human-centric video generation from a start-end image pair and transitional music. Existing human-centric generation methods are not competent for this task because they require frame-wise pose as input or have difficulty handling long-duration videos. Our key idea is to build a temporal generation framework dominated by DDPM and assisted by VAE and GAN. It reduces the computational cost of music-image diffusion by utilizing the latent space compactness of VAE and the image translation efficiency of GAN. To produce videos with both long duration and high quality, our framework first generates small-scale keyframes and then generates high-resolution videos. To strengthen the frame-wise consistency of the human body, a frame-aligned correspondence map is adopted as an intermediate supervision. Extensive experiments compared with the SOTA method have demonstrated the rationality and effectiveness of this signal generation framework.

引用

页码：506 / 510

页数：5

共 34 条

[31]

Wu Y., 2023, P IEEE INT C AC SPEE, P1

[32]

Yin SM, 2023, PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, P1309

[33] Video Frame Interpolation With Learnable Uncertainty and Decomposition [J].

Yu, Zhiyang ;

Chen, Xijun ;

Ren, Shunqing .

IEEE SIGNAL PROCESSING LETTERS, 2022, 29 :2642-2646

[34] DSF-Net: Dual-Stream Fused Network for Video Frame Interpolation [J].

Zhang, Fuhua ;

Yang, Chuang .

IEEE SIGNAL PROCESSING LETTERS, 2023, 30 :1122-1126

← 1 2 3 4 →