Factorizing Text-to-Video Generation by Explicit Image Conditioning

被引:0
|
作者
Girdhar, Rohit [1 ]
Singh, Mannat [1 ]
Brown, Andrew [1 ]
Duval, Quentin [1 ]
Azadi, Samaneh [1 ]
Rambhatla, Sai Saketh [1 ]
Shah, Akbar [1 ]
Yin, Xi [1 ]
Parikh, Devi [1 ]
Misra, Ishan [1 ]
机构
[1] Meta, GenAI, New York, NY 10003 USA
来源
关键词
D O I
10.1007/978-3-031-73033-7_12
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present Emu Video, a text-to-video generation model that factorizes the generation into two steps: first generating an image conditioned on the text, and then generating a video conditioned on the text and the generated image. We identify critical design decisions-adjusted noise schedules for diffusion, and multi-stage training-that enable us to directly generate high quality and high resolution videos, without requiring a deep cascade of models as in prior work. In human evaluations, our generated videos are strongly preferred in quality compared to all prior work-81% vs. Google's Imagen Video, 90% vs. Nvidia's PYOCO, and 96% vs. Meta's Make-A-Video. Our model outperforms commercial solutions such as RunwayML's Gen2 and Pika Labs. Finally, our factorizing approach naturally lends itself to animating images based on a user's text prompt, where our generations are preferred 96% over prior work.
引用
收藏
页码:205 / 224
页数:20
相关论文
共 50 条
  • [1] Grid Diffusion Models for Text-to-Video Generation
    Lee, Taegyeong
    Kwon, Soyeong
    Kim, Taehwan
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 8734 - 8743
  • [2] Learning Text-to-Video Retrieval from Image Captioning
    Ventura, Lucas
    Schmid, Cordelia
    Varol, Gul
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, : 1834 - 1854
  • [3] ImproveYourVideos: Architectural Improvements for Text-to-Video Generation Pipeline
    Arkhipkin, Vladimir
    Shaheen, Zein
    Vasilev, Viacheslav
    Dakhova, Elizaveta
    Sobolev, Konstantin
    Kuznetsov, Andrey
    Dimitrov, Denis
    IEEE ACCESS, 2025, 13 : 1986 - 2003
  • [4] MEVG: Multi-event Video Generation with Text-to-Video Models
    Oh, Gyeongrok
    Jeong, Jaehwan
    Kim, Sieun
    Byeon, Wonmin
    Kim, Jinkyu
    Kim, Sungwoong
    Kim, Sangpil
    COMPUTER VISION-ECCV 2024, PT XLIII, 2025, 15101 : 401 - 418
  • [5] Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
    Wu, Jay Zhangjie
    Ge, Yixiao
    Wang, Xintao
    Lei, Stan Weixian
    Gu, Yuchao
    Shi, Yufei
    Hsu, Wynne
    Shan, Ying
    Qie, Xiaohu
    Shou, Mike Zheng
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 7589 - 7599
  • [6] Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation
    Wang, Wenjing
    Yang, Huan
    Tuo, Zixi
    He, Huiguo
    Zhu, Junchen
    Fu, Jianlong
    Liu, Jiaying
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025,
  • [7] Text-to-video Generation: Research Status, Progress and Challenges
    Deng Z.
    He X.
    Peng Y.
    Dianzi Yu Xinxi Xuebao/Journal of Electronics and Information Technology, 2024, 46 (05): : 1632 - 1644
  • [8] A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
    Wang, Xiang
    Zhang, Shiwei
    Yuan, Hangjie
    Qing, Zhiwu
    Gong, Biao
    Zhang, Yingya
    Shen, Yujun
    Gao, Changxin
    Sang, Nong
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 6572 - 6582
  • [9] Conditional GAN with Discriminative Filter Generation for Text-to-Video Synthesis
    Balaji, Yogesh
    Min, Martin Renqiang
    Bai, Bing
    Chellappa, Rama
    Graf, Hans Peter
    PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 1995 - 2001
  • [10] ScenarioDiff: Text-to-video Generation with Dynamic Transformations of Scene Conditions
    Zhang, Yipeng
    Wang, Xin
    Chen, Hong
    Qin, Chenyang
    Hao, Yibo
    Mei, Hong
    Zhu, Wenwu
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025,