A Recipe for Scaling up Text-to-Video Generation with Text-free Videos

被引:2
作者
Wang, Xiang [1 ,2 ]
Zhang, Shiwei [2 ]
Yuan, Hangjie [3 ]
Qing, Zhiwu [1 ]
Gong, Biao [2 ]
Zhang, Yingya [2 ]
Shen, Yujun [4 ]
Gao, Changxin [1 ]
Sang, Nong [1 ]
机构
[1] Huazhong Univ Sci & Technol, Sch Artificial Intelligence & Automat, Key Lab Image Proc & Intelligent Control, Wuhan, Peoples R China
[2] Alibaba Grp, Hangzhou, Peoples R China
[3] Zhejiang Univ, Hangzhou, Peoples R China
[4] Ant Grp, Hangzhou, Peoples R China
来源
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024 | 2024年
基金
中国国家自然科学基金;
关键词
D O I
10.1109/CVPR52733.2024.00628
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Diffusion-based text-to-video generation has witnessed impressive progress in the past year yet still falls behind text-to-image generation. One of the key reasons is the limited scale of publicly available data (e.g., 10M video-text pairs in WebVid10M vs. 5B image-text pairs in LAION), considering the high cost of video captioning. Instead, it could be far easier to collect unlabeled clips from video platforms like YouTube. Motivated by this, we come up with a novel text-to-video generation framework, termed TF-T2V, which can directly learn with text-free videos. The rationale behind is to separate the process of text decoding from that of temporal modeling. To this end, we employ a content branch and a motion branch, which are jointly optimized with weights shared. Following such a pipeline, we study the effect of doubling the scale of training set (i.e., video-only WebVid10M) with some randomly collected text-free videos and are encouraged to observe the performance improvement (FID from 9.67 to 8.19 and FVD from 484 to 441), demonstrating the scalability of our approach. We also find that our model could enjoy sustainable performance gain (FID from 8.19 to 7.64 and FVD from 441 to 366) after reintroducing some text labels for training. Finally, we validate the effectiveness and generalizability of our ideology on both native text-to-video generation and compositional video synthesis paradigms. Code and models will be publicly available at here.
引用
收藏
页码:6572 / 6582
页数:11
相关论文
共 80 条
[51]  
Song J., 2021, ICLR
[52]   MoCoGAN: Decomposing Motion and Content for Video Generation [J].
Tulyakov, Sergey ;
Liu, Ming-Yu ;
Yang, Xiaodong ;
Kautz, Jan .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1526-1535
[53]  
Wang J., 2023, arXiv
[54]   TDN: Temporal Difference Networks for Efficient Action Recognition [J].
Wang, Limin ;
Tong, Zhan ;
Ji, Bin ;
Wu, Gangshan .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :1895-1904
[55]  
Wang W., 2023, ARXIV
[56]   A review of carbon footprint reduction of green building technologies in China [J].
Wang, Xi ;
Pan, Yiqun ;
Liang, Yumin ;
Zeng, Fei ;
Fu, Ling ;
Li, Jing ;
Sun, Tianrui .
PROGRESS IN ENERGY, 2023, 5 (03)
[57]   MoLo: Motion-augmented Long-short Contrastive Learning for Few-shot Action Recognition [J].
Wang, Xiang ;
Zhang, Shiwei ;
Qing, Zhiwu ;
Gao, Changxin ;
Zhang, Yingya ;
Zhao, Deli ;
Sang, Nong .
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :18011-18021
[58]  
Wang Xiting, 2023, ARXIV
[59]   G3AN: Disentangling Appearance and Motion for Video Generation [J].
Wang, Yaohui ;
Bilinski, Piotr ;
Bremond, Francois ;
Dantcheva, Antitza .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :5263-5272
[60]  
Wang Yingheng, 2023, arXiv