Modulating Pretrained Diffusion Models for Multimodal Image Synthesis

被引：3

作者：

Ham, Cusuh ^{[1
]}

Hays, James ^{[1
]}

Lu, Jingwan ^{[2
]}

Singh, Krishna Kumar ^{[2
]}

Zhang, Zhifei ^{[2
]}

Hinz, Tobias ^{[2
]}

机构：

[1] Georgia Inst Technol, Atlanta, GA 30332 USA

[2] Adobe Res, San Francisco, CA USA

来源：

PROCEEDINGS OF SIGGRAPH 2023 CONFERENCE PAPERS, SIGGRAPH 2023 | 2023年

关键词：

image synthesis; image generation; multimodal synthesis; neural networks; diffusion models;

D O I：

10.1145/3588432.3591549

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We present multimodal conditioning modules (MCM) for enabling conditional image synthesis using pretrained diffusion models. Previous multimodal synthesis works rely on training networks from scratch or fine-tuning pretrained networks, both of which are computationally expensive for large, state-of-the-art diffusion models. Our method uses pretrained networks but does not require any updates to the diffusion network's parameters. MCM is a small module trained to modulate the diffusion network's predictions during sampling using 2D modalities (e.g., semantic segmentation maps, sketches) that were unseen during the original training of the diffusion model. We show that MCM enables user control over the spatial layout of the image and leads to increased control over the image generation process. Training MCM is cheap as it does not require gradients from the original diffusion net, consists of only similar to 1% of the number of parameters of the base diffusion model, and is trained using only a limited number of training examples. We evaluate our method on unconditional and text-conditional models to demonstrate the improved control over the generated images and their alignment with respect to the conditioning inputs.

引用

页数：11

共 50 条

[1] Uni-paint: A Unified Framework for Multimodal Image Inpainting with Pretrained Diffusion Model
Yang, Shiyuan
Chen, Xiaodong
Liao, Jing
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3190 - 3199
[2] Prompt Tuning for Unified Multimodal Pretrained Models
Yang, Hao
Lin, Junyang
Yang, An
Wang, Peng
Zhou, Chang
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 402 - 416
[3] Visual Commonsense in Pretrained Unimodal and Multimodal Models
Zhang, Chenyu
Van Durme, Benjamin
Li, Zhuowan
Stengel-Eskin, Elias
NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 5321 - 5335
[4] Point-Cloud Completion with Pretrained Text-to-image Diffusion Models
Kasten, Yoni
Rahamim, Ohad
Chechik, Gal
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[5] Transferring General Multimodal Pretrained Models to Text Recognition
Lin, Junyang
Ren, Xuancheng
Zhang, Yichang
Liu, Gao
Wang, Peng
Yang, An
Zhou, Chang
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 588 - 597
[6] medigan: a Python']Python library of pretrained generative models for medical image synthesis
Osuala, Richard
Skorupko, Grzegorz
Lazrak, Noussair
Garrucho, Lidia
Garcia, Eloy
Joshi, Smriti
Jouide, Socayna
Rutherford, Michael
Prior, Fred
Kushibar, Kaisar
Diaz, Oliver
Lekadir, Karim
JOURNAL OF MEDICAL IMAGING, 2023, 10 (06)
[7] SAR Image Synthesis with Diffusion Models
Qosja, Denisa
Wagner, Simon
O'Hagan, Daniel
2024 IEEE RADAR CONFERENCE, RADARCONF 2024, 2024,
[8] Multimodal Data Augmentation for Image Captioning using Diffusion Models
Xiao, Changrong
Xu, Sean Xin
Zhang, Kunpeng
PROCEEDINGS OF THE 1ST WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM3A 2023, 2023, : 23 - 33
[9] A multimodal comparison of latent denoising diffusion probabilistic models and generative adversarial networks for medical image synthesis
Mueller-Franzes, Gustav
Niehues, Jan Moritz
Khader, Firas
Arasteh, Soroosh Tayebi
Haarburger, Christoph
Kuhl, Christiane
Wang, Tianci
Han, Tianyu
Nolte, Teresa
Nebelung, Sven
Kather, Jakob Nikolas
Truhn, Daniel
SCIENTIFIC REPORTS, 2023, 13 (01)
[10] A multimodal comparison of latent denoising diffusion probabilistic models and generative adversarial networks for medical image synthesis
Gustav Müller-Franzes
Jan Moritz Niehues
Firas Khader
Soroosh Tayebi Arasteh
Christoph Haarburger
Christiane Kuhl
Tianci Wang
Tianyu Han
Teresa Nolte
Sven Nebelung
Jakob Nikolas Kather
Daniel Truhn
Scientific Reports, 13

← 1 2 3 4 5 →