Structure and Content-Guided Video Synthesis with Diffusion Models

被引:151
作者
Esser, Patrick [1 ]
Chiu, Johnathan [1 ]
Atighehchian, Parmida [1 ]
Granskog, Jonathan [1 ]
Germanidis, Anastasis [1 ]
机构
[1] Runway, New York, NY 10013 USA
来源
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV | 2023年
关键词
D O I
10.1109/ICCV51070.2023.00675
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text-guided generative diffusion models unlock powerful image creation and editing tools. Recent approaches that edit the content of footage while retaining structure require expensive re-training for every input or rely on error-prone propagation of image edits across frames. In this work, we present a structure and content-guided video diffusion model that edits videos based on descriptions of the desired output. Conflicts between user-provided content edits and structure representations occur due to insufficient disentanglement between the two aspects. As a solution, we show that training on monocular depth estimates with varying levels of detail provides control over structure and content fidelity. A novel guidance method, enabled by joint video and image training, exposes explicit control over temporal consistency. Our experiments demonstrate a wide variety of successes; fine-grained control over output characteristics, customization based on a few reference images, and a strong user preference towards results by our model.
引用
收藏
页码:7312 / 7322
页数:11
相关论文
共 67 条
[1]  
Acharya D, 2018, ARXIV181002419
[2]  
Alexander S., 2022, DISCO DIFFUSION V5 2
[3]  
[Anonymous], PMLR
[4]  
Balaji Yogesh, 2022, ARXIV221101324
[5]  
Bansal A., 2023, Cold diffusion: Inverting arbitrary image transforms without noise
[6]   Text2LIVE: Text-Driven Layered Image and Video Editing [J].
Bar-Tal, Omer ;
Ofri-Amar, Dolev ;
Fridman, Rafail ;
Kasten, Yoni ;
Dekel, Tali .
COMPUTER VISION - ECCV 2022, PT XV, 2022, 13675 :707-723
[7]  
Brooks Tim, 2022, GENERATING LONG VIDE, P2
[8]  
Brown T., 2020, Advances in Neural Information Processing Systems, P1877, DOI [10.48550/ARXIV.2005.14165, DOI 10.48550/ARXIV.2005.14165, 10.48550/arXiv.2005.14165]
[9]   Coherent Online Video Style Transfer [J].
Chen, Dongdong ;
Liao, Jing ;
Yuan, Lu ;
Yu, Nenghai ;
Hua, Gang .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :1114-1123
[10]  
deforum, 2022, DEF STABL DIFF