Modulating Pretrained Diffusion Models for Multimodal Image Synthesis

被引:3
|
作者
Ham, Cusuh [1 ]
Hays, James [1 ]
Lu, Jingwan [2 ]
Singh, Krishna Kumar [2 ]
Zhang, Zhifei [2 ]
Hinz, Tobias [2 ]
机构
[1] Georgia Inst Technol, Atlanta, GA 30332 USA
[2] Adobe Res, San Francisco, CA USA
关键词
image synthesis; image generation; multimodal synthesis; neural networks; diffusion models;
D O I
10.1145/3588432.3591549
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present multimodal conditioning modules (MCM) for enabling conditional image synthesis using pretrained diffusion models. Previous multimodal synthesis works rely on training networks from scratch or fine-tuning pretrained networks, both of which are computationally expensive for large, state-of-the-art diffusion models. Our method uses pretrained networks but does not require any updates to the diffusion network's parameters. MCM is a small module trained to modulate the diffusion network's predictions during sampling using 2D modalities (e.g., semantic segmentation maps, sketches) that were unseen during the original training of the diffusion model. We show that MCM enables user control over the spatial layout of the image and leads to increased control over the image generation process. Training MCM is cheap as it does not require gradients from the original diffusion net, consists of only similar to 1% of the number of parameters of the base diffusion model, and is trained using only a limited number of training examples. We evaluate our method on unconditional and text-conditional models to demonstrate the improved control over the generated images and their alignment with respect to the conditioning inputs.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Uni-paint: A Unified Framework for Multimodal Image Inpainting with Pretrained Diffusion Model
    Yang, Shiyuan
    Chen, Xiaodong
    Liao, Jing
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3190 - 3199
  • [2] Prompt Tuning for Unified Multimodal Pretrained Models
    Yang, Hao
    Lin, Junyang
    Yang, An
    Wang, Peng
    Zhou, Chang
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 402 - 416
  • [3] Visual Commonsense in Pretrained Unimodal and Multimodal Models
    Zhang, Chenyu
    Van Durme, Benjamin
    Li, Zhuowan
    Stengel-Eskin, Elias
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 5321 - 5335
  • [4] Point-Cloud Completion with Pretrained Text-to-image Diffusion Models
    Kasten, Yoni
    Rahamim, Ohad
    Chechik, Gal
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [5] Transferring General Multimodal Pretrained Models to Text Recognition
    Lin, Junyang
    Ren, Xuancheng
    Zhang, Yichang
    Liu, Gao
    Wang, Peng
    Yang, An
    Zhou, Chang
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 588 - 597
  • [6] medigan: a Python']Python library of pretrained generative models for medical image synthesis
    Osuala, Richard
    Skorupko, Grzegorz
    Lazrak, Noussair
    Garrucho, Lidia
    Garcia, Eloy
    Joshi, Smriti
    Jouide, Socayna
    Rutherford, Michael
    Prior, Fred
    Kushibar, Kaisar
    Diaz, Oliver
    Lekadir, Karim
    JOURNAL OF MEDICAL IMAGING, 2023, 10 (06)
  • [7] SAR Image Synthesis with Diffusion Models
    Qosja, Denisa
    Wagner, Simon
    O'Hagan, Daniel
    2024 IEEE RADAR CONFERENCE, RADARCONF 2024, 2024,
  • [8] Multimodal Data Augmentation for Image Captioning using Diffusion Models
    Xiao, Changrong
    Xu, Sean Xin
    Zhang, Kunpeng
    PROCEEDINGS OF THE 1ST WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM3A 2023, 2023, : 23 - 33
  • [9] A multimodal comparison of latent denoising diffusion probabilistic models and generative adversarial networks for medical image synthesis
    Mueller-Franzes, Gustav
    Niehues, Jan Moritz
    Khader, Firas
    Arasteh, Soroosh Tayebi
    Haarburger, Christoph
    Kuhl, Christiane
    Wang, Tianci
    Han, Tianyu
    Nolte, Teresa
    Nebelung, Sven
    Kather, Jakob Nikolas
    Truhn, Daniel
    SCIENTIFIC REPORTS, 2023, 13 (01)
  • [10] A multimodal comparison of latent denoising diffusion probabilistic models and generative adversarial networks for medical image synthesis
    Gustav Müller-Franzes
    Jan Moritz Niehues
    Firas Khader
    Soroosh Tayebi Arasteh
    Christoph Haarburger
    Christiane Kuhl
    Tianci Wang
    Tianyu Han
    Teresa Nolte
    Sven Nebelung
    Jakob Nikolas Kather
    Daniel Truhn
    Scientific Reports, 13