Grid Diffusion Models for Text-to-Video Generation

被引：1

作者：

Lee, Taegyeong ^{[1
]}

Kwon, Soyeong ^{[1
]}

Kim, Taehwan ^{[1
]}

机构：

[1] UNIST, Artificial Intelligence Grad Sch, Ulsan, South Korea

来源：

2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024 | 2024年

基金：

新加坡国家研究基金会;

关键词：

D O I：

10.1109/CVPR52733.2024.00834

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent advances in the diffusion models have significantly improved text-to-image generation. However, generating videos from text is a more challenging task than generating images from text, due to the much larger dataset and higher computational cost required. Most existing video generation methods use either a 3D U-Net architecture that considers the temporal dimension or autoregressive generation. These methods require large datasets and are limited in terms of computational costs compared to text-to-image generation. To tackle these challenges, we propose a simple but effective novel grid diffusion for text-to-video generation without temporal dimension in architecture and a large text-video paired dataset. We can generate a high-quality video using a fixed amount of GPU memory regardless of the number of frames by representing the video as a grid image. Additionally, since our method reduces the dimensions of the video to the dimensions of the image, various image-based methods can be applied to videos, such as text-guided video manipulation from image manipulation. Our proposed method outperforms the existing methods in both quantitative and qualitative evaluations, demonstrating the suitability of our model for real-world video generation.

引用

页码：8734 / 8743

页数：10

共 47 条

[1]

An J., 2023, arXiv

[2]

[Anonymous], 2016, PMLR

[3] Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [J].

Bain, Max ;

Nagrani, Arsha ;

Varol, Gul ;

Zisserman, Andrew .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1708-1718

[4]

Balaji Y., 2022, arXiv

[5]

Blattmann A., 2023, ARXIV

[6]

Brooks Tim, 2022, arXiv

[7]

Ceylan D., 2023, ARXIV

[8]

Chen H, 2023, Videocrafter1: Open diffusion models for high-quality video generation

[9]

Dhariwal P, 2021, ADV NEUR IN, V34

[10]

Feng W., 2022, ARXIV

← 1 2 3 4 5 →