Structure and Content-Guided Video Synthesis with Diffusion Models

被引：151

作者：

Esser, Patrick ^{[1
]}

Chiu, Johnathan ^{[1
]}

Atighehchian, Parmida ^{[1
]}

Granskog, Jonathan ^{[1
]}

Germanidis, Anastasis ^{[1
]}

机构：

[1] Runway, New York, NY 10013 USA

来源：

2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV | 2023年

关键词：

D O I：

10.1109/ICCV51070.2023.00675

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Text-guided generative diffusion models unlock powerful image creation and editing tools. Recent approaches that edit the content of footage while retaining structure require expensive re-training for every input or rely on error-prone propagation of image edits across frames. In this work, we present a structure and content-guided video diffusion model that edits videos based on descriptions of the desired output. Conflicts between user-provided content edits and structure representations occur due to insufficient disentanglement between the two aspects. As a solution, we show that training on monocular depth estimates with varying levels of detail provides control over structure and content fidelity. A novel guidance method, enabled by joint video and image training, exposes explicit control over temporal consistency. Our experiments demonstrate a wide variety of successes; fine-grained control over output characteristics, customization based on a few reference images, and a strong user preference towards results by our model.

引用

页码：7312 / 7322

页数：11

共 67 条

[1]

Acharya D, 2018, ARXIV181002419

[2]

Alexander S., 2022, DISCO DIFFUSION V5 2

[3]

[Anonymous], PMLR

[4]

Balaji Yogesh, 2022, ARXIV221101324

[5]

Bansal A., 2023, Cold diffusion: Inverting arbitrary image transforms without noise

[6] Text2LIVE: Text-Driven Layered Image and Video Editing [J].

Bar-Tal, Omer ;

Ofri-Amar, Dolev ;

Fridman, Rafail ;

Kasten, Yoni ;

Dekel, Tali .

COMPUTER VISION - ECCV 2022, PT XV, 2022, 13675 :707-723

[7]

Brooks Tim, 2022, GENERATING LONG VIDE, P2

[8]

Brown T., 2020, Advances in Neural Information Processing Systems, P1877, DOI [10.48550/ARXIV.2005.14165, DOI 10.48550/ARXIV.2005.14165, 10.48550/arXiv.2005.14165]

[9] Coherent Online Video Style Transfer [J].

Chen, Dongdong ;

Liao, Jing ;

Yuan, Lu ;

Yu, Nenghai ;

Hua, Gang .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :1114-1123

[10]

deforum, 2022, DEF STABL DIFF

← 1 2 3 4 5 6 7 →