Multi-Region Text-Driven Manipulation of Diffusion Imagery

被引:0
作者
Li, Yiming [1 ,2 ]
Zhou, Peng [3 ]
Sun, Jun [1 ]
Xu, Yi [1 ,2 ]
机构
[1] Shanghai Jiao Tong Univ, Shanghai Key Lab Digital Media Proc & Transmiss, Shanghai, Peoples R China
[2] Shanghai Jiao Tong Univ, AI Inst, MoE, Key Lab Artificial Intelligence, Shanghai, Peoples R China
[3] China Mobile Suzhou Software Technol Co Ltd, Suzhou, Peoples R China
来源
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 4 | 2024年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text-guided image manipulation has attracted significant attention recently. Prevailing techniques concentrate on image attribute editing for individual objects, however, encountering challenges when it comes to multi-object editing. The main reason is the lack of consistency constraints on the spatial layout. This work presents a multi-region guided image manipulation framework, enabling manipulation through region-level textual prompts. With MultiDiffusion as a baseline, we are dedicated to the automatic generation of a rational multi-object spatial distribution, where disparate regions are fused as a unified entity. To mitigate interference from regional fusion, we employ an off-the-shelf model (CLIP) to impose region-aware spatial guidance on multi-object manipulation. Moreover, when applied to the StableDiffusion, the presence of quality-related yet object-agnostic lengthy words hampers the manipulation. To ensure focus on meaningful object-specific words for efficient guidance and generation, we introduce a keyword selection method. Furthermore, we demonstrate a downstream application of our method for multi-region inversion, which is tailored for manipulating multiple objects in real images. Our approach, compatible with variants of Stable Diffusion models, is readily applicable for manipulating diverse objects in extensive images with high-quality generation, showing superb image control capabilities. Code is available at https://github.com/liyiming09/multi-region-guided-diffusion.
引用
收藏
页码:3261 / 3269
页数:9
相关论文
共 59 条
  • [1] Blended Latent Diffusion
    Avrahami, Omri
    Fried, Ohad
    Lischinski, Dani
    [J]. ACM TRANSACTIONS ON GRAPHICS, 2023, 42 (04):
  • [2] Blended Diffusion for Text-driven Editing of Natural Images
    Avrahami, Omri
    Lischinski, Dani
    Fried, Ohad
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 18187 - 18197
  • [3] Text2LIVE: Text-Driven Layered Image and Video Editing
    Bar-Tal, Omer
    Ofri-Amar, Dolev
    Fridman, Rafail
    Kasten, Yoni
    Dekel, Tali
    [J]. COMPUTER VISION - ECCV 2022, PT XV, 2022, 13675 : 707 - 723
  • [4] Bar-Tal Omer, 2023, MULTIDIFFUSION FUSIN
  • [5] InstructPix2Pix: Learning to Follow Image Editing Instructions
    Brooks, Tim
    Holynski, Aleksander
    Efros, Alexei A.
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 18392 - 18402
  • [6] Chen MH, 2023, Arxiv, DOI arXiv:2304.03373
  • [7] Couairon G., 2022, arXiv
  • [8] Diffusion Models in Vision: A Survey
    Croitoru, Florinel-Alin
    Hondru, Vlad
    Ionescu, Radu Tudor
    Shah, Mubarak
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (09) : 10850 - 10869
  • [9] VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance
    Crowson, Katherine
    Biderman, Stella
    Kornis, Daniel
    Stander, Dashiell
    Hallahan, Eric
    Castricato, Louis
    Raff, Edward
    [J]. COMPUTER VISION, ECCV 2022, PT XXXVII, 2022, 13697 : 88 - 105
  • [10] Dhariwal P, 2021, ADV NEUR IN, V34