WAVE: Warping DDIM Inversion Features for Zero-Shot Text-to-Video Editing

被引:0
作者
Feng, Yutang [1 ,5 ]
Gao, Sicheng [1 ,3 ]
Bao, Yuxiang [1 ]
Wang, Xiaodi [2 ]
Han, Shumin [1 ,2 ]
Zhang, Juan [1 ]
Zhang, Baochang [1 ,4 ]
Yao, Angela [3 ]
机构
[1] Beihang Univ, Beijing, Peoples R China
[2] Baidu VIS, Beijing, Peoples R China
[3] Natl Univ Singapore, Singapore, Singapore
[4] Zhongguancun Lab, Beijing, Peoples R China
[5] Baidu, Beijing, Peoples R China
来源
COMPUTER VISION - ECCV 2024, PT LXXVI | 2025年 / 15134卷
基金
新加坡国家研究基金会; 北京市自然科学基金; 中国国家自然科学基金;
关键词
Text to video editing; DDIM inversion; Flow-guided warping;
D O I
10.1007/978-3-031-73116-7_3
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text-driven video editing has emerged as a prominent application based on the breakthroughs of image diffusion models. Existing state-of-the-art methods focus on zero-shot frameworks due to limited training data and computing resources. To preserve structure consistency, previous frameworks usually employ Denoising Diffusion Implicit Model (DDIM) inversion to provide inverted noise latents as guidance. The key challenge lies in limiting errors caused by the randomness and inaccuracy in each step of the naive DDIM inversion process, which can lead to temporal inconsistency in video editing tasks. Our observation indicates that incorporating temporal keyframe information can alleviate the accumulated error during inversion. In this paper, we propose an effective warping strategy in the feature domain to obtain high-quality DDIM inverted noise latents. Specifically, we shuffle the editing frames randomly in each timestep and use optical flow extracted from the source video to propagate the latent features of the first keyframe to subsequent keyframes. Moreover, we develop a comprehensive zero-shot framework that adapts to this strategy in both the inversion and denoising processes, thereby facilitating the generation of consistent edited videos. We compare our method with state-of-the-art text-driven editing methods on various real-world videos with different forms of motion. The project page is available at https://ree1s.github.io/wave/.
引用
收藏
页码:38 / 55
页数:18
相关论文
共 57 条
[1]  
Avrahami O, 2023, Arxiv, DOI arXiv:2206.02779
[2]  
Bao YX, 2023, Arxiv, DOI arXiv:2311.00353
[3]   Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models [J].
Blattmann, Andreas ;
Rombach, Robin ;
Ling, Huan ;
Dockhorn, Tim ;
Kim, Seung Wook ;
Fidler, Sanja ;
Kreis, Karsten .
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :22563-22575
[4]  
Ceylan D, 2023, Arxiv, DOI arXiv:2303.12688
[5]   GLEAN: Generative Latent Bank for Large-Factor Image Super-Resolution [J].
Chan, Kelvin C. K. ;
Wang, Xintao ;
Xu, Xiangyu ;
Gu, Jinwei ;
Loy, Chen Change .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :14240-14249
[6]  
Cong YR, 2024, Arxiv, DOI arXiv:2310.05922
[7]   Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion Models [J].
Dong, Wenkai ;
Xue, Song ;
Duan, Xiaoyue ;
Han, Shumin .
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, :7396-7406
[8]  
Duan XY, 2024, AAAI CONF ARTIF INTE, P1644
[9]  
Esser P., 2023, arXiv
[10]   Implicit Diffusion Models for Continuous Super-Resolution [J].
Gao, Sicheng ;
Liu, Xuhui ;
Zeng, Bohan ;
Xu, Sheng ;
Li, Yanjing ;
Luo, Xiaoyan ;
Liu, Jianzhuang ;
Zhen, Xiantong ;
Zhang, Baochang .
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :10021-10030