FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video Editing

被引：0

作者：

Zhang, Youyuan ^{[1
]}

Ju, Xuan ^{[2
]}

Clark, James J. ^{[1
]}

机构：

[1] McGill Univ, Montreal, PQ, Canada

[2] Chinese Univ Hong Kong, Hong Kong, Peoples R China

来源：

2025 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION, WACV | 2025年

关键词：

D O I：

10.1109/WACV61041.2025.00360

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Diffusion models have demonstrated remarkable capabilities in text-to-image and text-to-video generation, opening up possibilities for video editing based on textual input. However, the computational cost associated with sequential sampling in diffusion models poses challenges for efficient video editing. Existing approaches relying on image generation models for video editing suffer from time-consuming one-shot fine-tuning, additional condition extraction, or DDIM inversion, making real-time applications impractical. In this work, we propose FastVideoEdit, an efficient zero-shot video editing approach inspired by Consistency Models (CMs). By leveraging the self-consistency property of CMs, we eliminate the need for time-consuming inversion or additional condition extraction, reducing editing time. Our method enables direct mapping from source video to target video with strong preservation ability through attention control. This results in improved speed advantages, as fewer sampling steps can be used while maintaining comparable generation quality. Experimental results validate the state-of-the-art performance and speed advantages of FastVideoEdit across evaluation metrics encompassing editing speed, temporal consistency, and text-video alignment. The source code is available at github.com/youyuan-zhang/FastVideoEdit.

引用

页码：3657 / 3666

页数：10

共 44 条

[1]

Balaji Y., 2022, arXiv

[2] Text2LIVE: Text-Driven Layered Image and Video Editing [J].

Bar-Tal, Omer ;

Ofri-Amar, Dolev ;

Fridman, Rafail ;

Kasten, Yoni ;

Dekel, Tali .

COMPUTER VISION - ECCV 2022, PT XV, 2022, 13675 :707-723

[3] Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models [J].

Blattmann, Andreas ;

Rombach, Robin ;

Ling, Huan ;

Dockhorn, Tim ;

Kim, Seung Wook ;

Fidler, Sanja ;

Kreis, Karsten .

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :22563-22575

[4]

Brooks Tim, 2024, Video generation models as world simulators, V1, P2

[5]

Cao MD, 2023, Arxiv, DOI arXiv:2304.08465

[6] Pix2Video: Video Editing using Image Diffusion [J].

Ceylan, Duygu ;

Huang, Chun-Hao P. ;

Mitra, Niloy J. .

2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, :23149-23160

[7]

Chai Wenhao, 2023, P IEEE CVF INT C COM, P23040

[8]

Chen WF, 2024, Arxiv, DOI arXiv:2305.13840

[9]

Cong YR, 2024, Arxiv, DOI arXiv:2310.05922

[10] Structure and Content-Guided Video Synthesis with Diffusion Models [J].

Esser, Patrick ;

Chiu, Johnathan ;

Atighehchian, Parmida ;

Granskog, Jonathan ;

Germanidis, Anastasis .

2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, :7312-7322

← 1 2 3 4 5 →