VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models

被引：2

作者：

Jeong, Hyeonho ^{[1
]}

Park, Geon Yeong ^{[2
]}

Ye, Jong Chul ^{[1
,2
]}

机构：

[1] Korea Adv Inst Sci & Technol, Kim Jaechul Grad Sch AI, Seoul, South Korea

[2] Korea Adv Inst Sci & Technol, Bio & Brain Engn, Seoul, South Korea

来源：

2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2024年

基金：

新加坡国家研究基金会;

关键词：

D O I：

10.1109/CVPR52733.2024.00880

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

frequency motion-unrelated noise in image space. We validate our method against state-of-the-art video generative models across diverse real-world motions and contexts. Our code and data can be found at https://video-motion-customization. github.io/.Text-to-video diffusion models have advanced video generation significantly. However, customizing these models to generate videos with tailored motions presents a substantial challenge. In specific, they encounter hurdles in (a) accurately reproducing motion from a target video, and (b) creating diverse visual variations. For example, straight-forward extensions of static image customization methods to video often lead to intricate entanglements of appearance and motion data. To tackle this, here we present the Video Motion Customization (VMC) framework, a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models. Our approach introduces a novel motion distillation objective using residual vectors between consecutive noisy latent frames as a motion reference. The diffusion process then preserve low-frequency motion trajectories while mitigating high-

引用

页码：9212 / 9221

页数：10

共 36 条

[1] Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [J].

Bain, Max ;

Nagrani, Arsha ;

Varol, Gul ;

Zisserman, Andrew .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1708-1718

[2] Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models [J].

Blattmann, Andreas ;

Rombach, Robin ;

Ling, Huan ;

Dockhorn, Tim ;

Kim, Seung Wook ;

Fidler, Sanja ;

Kreis, Karsten .

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :22563-22575

[3]

Byeon M., 2022, Coyo-700m: Image-text pair dataset

[4] Pix2Video: Video Editing using Image Diffusion [J].

Ceylan, Duygu ;

Huang, Chun-Hao P. ;

Mitra, Niloy J. .

2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, :23149-23160

[5]

Chen Wei-Ting, 2023, arXiv

[6] Tweedie's Formula and Selection Bias [J].

Efron, Bradley .

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2011, 106 (496) :1602-1614

[7] Structure and Content-Guided Video Synthesis with Diffusion Models [J].

Esser, Patrick ;

Chiu, Johnathan ;

Atighehchian, Parmida ;

Granskog, Jonathan ;

Germanidis, Anastasis .

2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, :7312-7322

[8]

Gal R., 2022, ArXiv

[9]

Ge SW, 2023, IEEE I CONF COMP VIS, P22873, DOI 10.1109/ICCV51070.2023.02096

[10]

Geyer Michal, 2023, ARXIV

← 1 2 3 4 →