Grid Diffusion Models for Text-to-Video Generation

被引：1

作者：

Lee, Taegyeong ^{[1
]}

Kwon, Soyeong ^{[1
]}

Kim, Taehwan ^{[1
]}

机构：

[1] UNIST, Artificial Intelligence Grad Sch, Ulsan, South Korea

来源：

2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024 | 2024年

基金：

新加坡国家研究基金会;

关键词：

D O I：

10.1109/CVPR52733.2024.00834

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent advances in the diffusion models have significantly improved text-to-image generation. However, generating videos from text is a more challenging task than generating images from text, due to the much larger dataset and higher computational cost required. Most existing video generation methods use either a 3D U-Net architecture that considers the temporal dimension or autoregressive generation. These methods require large datasets and are limited in terms of computational costs compared to text-to-image generation. To tackle these challenges, we propose a simple but effective novel grid diffusion for text-to-video generation without temporal dimension in architecture and a large text-video paired dataset. We can generate a high-quality video using a fixed amount of GPU memory regardless of the number of frames by representing the video as a grid image. Additionally, since our method reduces the dimensions of the video to the dimensions of the image, various image-based methods can be applied to videos, such as text-guided video manipulation from image manipulation. Our proposed method outperforms the existing methods in both quantitative and qualitative evaluations, demonstrating the suitability of our model for real-world video generation.

引用

页码：8734 / 8743

页数：10

共 47 条

[11]

Fu Tsu-Jui, 2022, ARXIV

[12]

Ge SW, 2023, IEEE I CONF COMP VIS, P22873, DOI 10.1109/ICCV51070.2023.02096

[13]

He Y., 2022, ARXIV

[14]

Ho J., 2022, ARXIV

[15]

Ho Jonathan, 2022, ARXIV

[16]

Hong Wenyi, 2022, ARXIV

[17]

Kawar B., 2022, arXiv

[18] VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation [J].

Luo, Zhengxiong ;

Chen, Dayou ;

Zhang, Yingya ;

Huang, Yan ;

Wang, Liang ;

Shen, Yujun ;

Zhao, Deli ;

Zhou, Jingren ;

Tan, Tieniu .

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :10209-10218

[19]

Nichol A., 2021, ARXIV

[20]

OpenAI, 2023, GPT-4 Technical Report

← 1 2 3 4 5 →