Text-to-video Generation: Research Status, Progress and Challenges

被引：0

作者：

Deng, Zijun ^{[1
]}

He, Xiangteng ^{[1
]}

Peng, Yuxin ^{[1
]}

机构：

[1] Peking Univ, Wangxuan Inst Comp Technol, Beijing 100080, Peoples R China

来源：

JOURNAL OF ELECTRONICS & INFORMATION TECHNOLOGY | 2024年 / 46卷 / 05期

基金：

中国国家自然科学基金;

关键词：

Text-to-video generation; Diffusion model; Generative Adversarial Network (GAN);

D O I：

10.11999/JEIT240074

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

The generation of video from text aims to produce semantically consistent, photo-realistic, temporal consistent, and logically coherent videos based on provided textual descriptions. Firstly, the current state of research in the field of text-to-video generation is elucidated in this paper, providing a detailed overview of three mainstream approaches: methods based on recurrent networks and Generative Adversarial Networks (GAN), methods based on Transformers, and methods based on diffusion models. Each of these models has its strengths and weaknesses in video generation. The recurrent networks and GAN-based methods can generate videos with higher resolution and duration but struggle with generating complex open-domain videos. Transformer-based methods show proficiency in generating open-domain videos but face challenges related to unidirectional biases and accumulated errors, making it difficult to produce high-fidelity videos. Diffusion models exhibit good generalization but are constrained by inference speed and high memory consumption, making it challenging to generate high-definition and lengthy videos. Subsequently, evaluation benchmarks and metrics in the text-to-video generation domain are explored, and the performance of existing methods is compared. Finally, potential future research directions in the field is outlined.

引用

页码：1632 / 1644

页数：13

共 56 条

[1] SLAMP: Stochastic Latent Appearance and Motion Prediction [J].

Akan, Adil Kaan ;

Erdem, Erkut ;

Erdem, Aykut ;

Guney, Fatma .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :14708-14717

[2]

[Anonymous], About Us

[3]

[Anonymous], About us

[4]

[Anonymous], About us

[5]

BABAEIZADEH M, 2018, 6 INT C LEARN REPR V, P1

[6]

Blattmann A, 2023, Arxiv, DOI arXiv:2311.15127

[7] MV-Diffusion: Motion-aware Video Diffusion Model [J].

Deng, Zijun ;

He, Xiangteng ;

Peng, Yuxin ;

Zhu, Xiongwei ;

Cheng, Lele .

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, :7255-7263

[8]

DENG Zijun, Efficiency

[9]

DENTON E, 2018, 35 INT C MACH LEARN, P1174

[10]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

← 1 2 3 4 5 6 →