Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation

被引:0
作者
Yariv, Guy [1 ,2 ]
Gat, Itai [3 ]
Benaim, Sagie [1 ]
Wolf, Lior [4 ]
Schwartz, Idan [2 ,4 ]
Adi, Yossi [1 ]
机构
[1] Hebrew Univ Jerusalem, Jerusalem, Israel
[2] NetApp, San Jose, CA 95128 USA
[3] Technion, Haifa, Israel
[4] Tel Aviv Univ, Tel Aviv, Israel
来源
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7 | 2024年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. For this task, the videos are required to be aligned both globally and temporally with the input audio: globally, the input audio is semantically associated with the entire output video, and temporally, each segment of the input audio is associated with a corresponding segment of that video. We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model. The proposed method is based on a lightweight adaptor network, which learns to map the audio-based representation to the input representation expected by the text-to-video generation model. As such, it also enables video generation conditioned on text, audio, and, for the first time as far as we can ascertain, on both text and audio. We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples and further propose a novel evaluation metric (AV-Align) to assess the alignment of generated videos with input audio samples. AV-Align is based on the detection and comparison of energy peaks in both modalities. In comparison to recent state-of-the-art approaches, our method generates videos that are better aligned with the input sound, both with respect to content and temporal axis. We also show that videos produced by our method present higher visual quality and are more diverse. Code and samples are available at: https://pages.cs.huji.ac.il/adiyoss-lab/TempoTokens/.
引用
收藏
页码:6639 / 6647
页数:9
相关论文
共 51 条
[1]  
Adi Y, 2019, INT CONF ACOUST SPEE, P3742, DOI 10.1109/ICASSP.2019.8682468
[2]   Video and Text Matching with Conditioned Embeddings [J].
Ali, Ameen ;
Schwartz, Idan ;
Hazan, Tamir ;
Wolf, Lior .
2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, :478-487
[3]  
Bock S., 2013, DAFx-13
[4]  
Brown TJ., 2020, [No title captured]
[5]   Sound2Sight: Generating Visual Dynamics from Sound and Context [J].
Chatterjee, Moitreya ;
Cherian, Anoop .
COMPUTER VISION - ECCV 2020, PT XXVII, 2020, 12372 :701-719
[6]  
Chen S., 2022, arXiv
[7]  
Copet J, 2024, Arxiv, DOI [arXiv:2306.05284, 10.48550/arXiv.2306.05284, DOI 10.48550/ARXIV.2306.05284]
[8]   VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance [J].
Crowson, Katherine ;
Biderman, Stella ;
Kornis, Daniel ;
Stander, Dashiell ;
Hallahan, Eric ;
Castricato, Louis ;
Raff, Edward .
COMPUTER VISION, ECCV 2022, PT XXXVII, 2022, 13697 :88-105
[9]   Taming Transformers for High-Resolution Image Synthesis [J].
Esser, Patrick ;
Rombach, Robin ;
Ommer, Bjoern .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :12868-12878
[10]   Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors [J].
Gafni, Oran ;
Polyak, Adam ;
Ashual, Oron ;
Sheynin, Shelly ;
Parikh, Devi ;
Taigman, Yaniv .
COMPUTER VISION - ECCV 2022, PT XV, 2022, 13675 :89-106