Multi-Region Text-Driven Manipulation of Diffusion Imagery

被引：0

作者：

Li, Yiming ^{[1
,2
]}

Zhou, Peng ^{[3
]}

Sun, Jun ^{[1
]}

Xu, Yi ^{[1
,2
]}

机构：

[1] Shanghai Jiao Tong Univ, Shanghai Key Lab Digital Media Proc & Transmiss, Shanghai, Peoples R China

[2] Shanghai Jiao Tong Univ, AI Inst, MoE, Key Lab Artificial Intelligence, Shanghai, Peoples R China

[3] China Mobile Suzhou Software Technol Co Ltd, Suzhou, Peoples R China

来源：

THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 4 | 2024年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Text-guided image manipulation has attracted significant attention recently. Prevailing techniques concentrate on image attribute editing for individual objects, however, encountering challenges when it comes to multi-object editing. The main reason is the lack of consistency constraints on the spatial layout. This work presents a multi-region guided image manipulation framework, enabling manipulation through region-level textual prompts. With MultiDiffusion as a baseline, we are dedicated to the automatic generation of a rational multi-object spatial distribution, where disparate regions are fused as a unified entity. To mitigate interference from regional fusion, we employ an off-the-shelf model (CLIP) to impose region-aware spatial guidance on multi-object manipulation. Moreover, when applied to the StableDiffusion, the presence of quality-related yet object-agnostic lengthy words hampers the manipulation. To ensure focus on meaningful object-specific words for efficient guidance and generation, we introduce a keyword selection method. Furthermore, we demonstrate a downstream application of our method for multi-region inversion, which is tailored for manipulating multiple objects in real images. Our approach, compatible with variants of Stable Diffusion models, is readily applicable for manipulating diverse objects in extensive images with high-quality generation, showing superb image control capabilities. Code is available at https://github.com/liyiming09/multi-region-guided-diffusion.

引用

页码：3261 / 3269

页数：9

共 59 条

[1] Blended Latent Diffusion
Avrahami, Omri
Fried, Ohad
Lischinski, Dani
[J]. ACM TRANSACTIONS ON GRAPHICS, 2023, 42 (04):
[2] Blended Diffusion for Text-driven Editing of Natural Images
Avrahami, Omri
Lischinski, Dani
Fried, Ohad
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 18187 - 18197
[3] Text2LIVE: Text-Driven Layered Image and Video Editing
Bar-Tal, Omer
Ofri-Amar, Dolev
Fridman, Rafail
Kasten, Yoni
Dekel, Tali
[J]. COMPUTER VISION - ECCV 2022, PT XV, 2022, 13675 : 707 - 723
[4] Bar-Tal Omer, 2023, MULTIDIFFUSION FUSIN
[5] InstructPix2Pix: Learning to Follow Image Editing Instructions
Brooks, Tim
Holynski, Aleksander
Efros, Alexei A.
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 18392 - 18402
[6] Chen MH, 2023, Arxiv, DOI arXiv:2304.03373
[7] Couairon G., 2022, arXiv
[8] Diffusion Models in Vision: A Survey
Croitoru, Florinel-Alin
Hondru, Vlad
Ionescu, Radu Tudor
Shah, Mubarak
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (09) : 10850 - 10869
[9] VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance
Crowson, Katherine
Biderman, Stella
Kornis, Daniel
Stander, Dashiell
Hallahan, Eric
Castricato, Louis
Raff, Edward
[J]. COMPUTER VISION, ECCV 2022, PT XXXVII, 2022, 13697 : 88 - 105
[10] Dhariwal P, 2021, ADV NEUR IN, V34

← 1 2 3 4 5 6 →