Pix2Video: Video Editing using Image Diffusion

被引：53

作者：

Ceylan, Duygu ^{[1
]}

Huang, Chun-Hao P. ^{[1
]}

Mitra, Niloy J. ^{[1
,2
]}

机构：

[1] Adobe Res, San Francisco, CA 94107 USA

[2] UCL, London, England

来源：

2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023) | 2023年

关键词：

D O I：

10.1109/ICCV51070.2023.02121

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Image diffusion models, trained on massive image collections, have emerged as the most versatile image generator model in terms of quality and diversity. They support inverting real images and conditional (e.g., text) generation, making them attractive for high-quality image editing applications. We investigate how to use such pre-trained image models for text-guided video editing. The critical challenge is to achieve the target edits while still preserving the content of the source video. Our method works in two simple steps: first, we use a pre-trained structure-guided (e.g., depth) image diffusion model to perform text-guided edits on an anchor frame; then, in the key step, we progressively propagate the changes to the future frames via self-attention feature injection to adapt the core denoising step of the diffusion model. We then consolidate the changes by adjusting the latent code for the frame before continuing the process. Our approach is training-free and generalizes to a wide range of edits. We demonstrate the effectiveness of the approach by extensive experimentation and compare it against four different prior and parallel efforts (on ArXiv). We demonstrate that realistic text-guided video edits are possible, without any compute-intensive preprocessing or video-specific finetuning.

引用

页码：23149 / 23160

页数：12

共 63 条

[1]

AlBahar Badour, 2022, arXiv preprint arXiv: 2206.10590

[2]

[Anonymous], 2022, COMPUTER VISION PATT, DOI DOI 10.1109/CVPR52688.2022.00361

[3]

[Anonymous], 2021, CVPR, DOI DOI 10.1109/TPAMI.2020.2970919

[4]

Balaji Yogesh, 2022, ARXIV221101324

[5] Text2LIVE: Text-Driven Layered Image and Video Editing [J].

Bar-Tal, Omer ;

Ofri-Amar, Dolev ;

Fridman, Rafail ;

Kasten, Yoni ;

Dekel, Tali .

COMPUTER VISION - ECCV 2022, PT XV, 2022, 13675 :707-723

[6]

Brock Andrew., 2019, ICLR

[7]

Brooks Tim, 2022, C NEUR INF PROC SYST, P2

[8]

Brooks Tim, 2022, ARXIV221109800

[9]

Choi J, 2022, AAAI CONF ARTIF INTE, P6367

[10]

Croitoru Florinel-Alin, 2022, ARXIV220904747

← 1 2 3 4 5 6 7 →