Grid Diffusion Models for Text-to-Video Generation

被引:1
作者
Lee, Taegyeong [1 ]
Kwon, Soyeong [1 ]
Kim, Taehwan [1 ]
机构
[1] UNIST, Artificial Intelligence Grad Sch, Ulsan, South Korea
来源
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024 | 2024年
基金
新加坡国家研究基金会;
关键词
D O I
10.1109/CVPR52733.2024.00834
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent advances in the diffusion models have significantly improved text-to-image generation. However, generating videos from text is a more challenging task than generating images from text, due to the much larger dataset and higher computational cost required. Most existing video generation methods use either a 3D U-Net architecture that considers the temporal dimension or autoregressive generation. These methods require large datasets and are limited in terms of computational costs compared to text-to-image generation. To tackle these challenges, we propose a simple but effective novel grid diffusion for text-to-video generation without temporal dimension in architecture and a large text-video paired dataset. We can generate a high-quality video using a fixed amount of GPU memory regardless of the number of frames by representing the video as a grid image. Additionally, since our method reduces the dimensions of the video to the dimensions of the image, various image-based methods can be applied to videos, such as text-guided video manipulation from image manipulation. Our proposed method outperforms the existing methods in both quantitative and qualitative evaluations, demonstrating the suitability of our model for real-world video generation.
引用
收藏
页码:8734 / 8743
页数:10
相关论文
共 47 条
[1]  
An J., 2023, arXiv
[2]  
[Anonymous], 2016, PMLR
[3]   Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [J].
Bain, Max ;
Nagrani, Arsha ;
Varol, Gul ;
Zisserman, Andrew .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1708-1718
[4]  
Balaji Y., 2022, arXiv
[5]  
Blattmann A., 2023, ARXIV
[6]  
Brooks Tim, 2022, arXiv
[7]  
Ceylan D., 2023, ARXIV
[8]  
Chen H, 2023, Videocrafter1: Open diffusion models for high-quality video generation
[9]  
Dhariwal P, 2021, ADV NEUR IN, V34
[10]  
Feng W., 2022, ARXIV