SpaText: Spatio-Textual Representation for Controllable Image Generation

被引：70

作者：

Avrahami, Omri ^{[1
,2
]}

Hayes, Thomas ^{[1
]}

Gafni, Oran ^{[1
]}

Gupta, Sonal ^{[1
]}

Taigman, Yaniv ^{[1
]}

Parikh, Devi ^{[1
]}

Lischinski, Dani ^{[2
]}

Fried, Ohad ^{[3
]}

Yin, Xi ^{[1
]}

机构：

[1] Meta AI, London, England

[2] Hebrew Univ Jerusalem, Jerusalem, Israel

[3] Reichman Univ, Herzliyya, Israel

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年

关键词：

D O I：

10.1109/CVPR52729.2023.01762

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent text-to-image diffusion models are able to generate convincing results of unprecedented quality. However, it is nearly impossible to control the shapes of different regions/objects or their layout in a fine-grained fashion. Previous attempts to provide such controls were hindered by their reliance on a fixed set of labels. To this end, we present SpaText - a new method for text-to-image generation using open-vocabulary scene control. In addition to a global text prompt that describes the entire scene, the user provides a segmentation map where each region of interest is annotated by a free-form natural language description. Due to lack of large-scale datasets that have a detailed textual description for each region in the image, we choose to leverage the current large-scale text-to-image datasets and base our approach on a novel CLIP-based spatio-textual representation, and show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-based. In addition, we show how to extend the classifier-free guidance method in diffusion models to the multi-conditional case and present an alternative accelerated inference algorithm. Finally, we offer several automatic evaluation metrics and use them, in addition to FID scores and a user study, to evaluate our method and show that it achieves state-of-the-art results on image generation with free-form textual scene control.

引用

页码：18370 / 18380

页数：11

共 75 条

[1]

Ackermann Johannes, 2022, ARXIV221012965

[2]

Amazon, 2022, Amazon Mechanical Turk

[3]

[Anonymous], IEEE I CONF COMP VIS, DOI DOI 10.1109/ICCV.2015.123

[4]

Ashual Oron, 2022, ARXIV220402849

[5]

Avrahami O., 2022, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, P18208

[6]

Avrahami Omri, 2022, ARXIV220602779

[7]

Bar-Tal Omer, 2022, ARXIV220402491

[8]

Bau David, 2021, arXiv preprint arXiv:2103.10951

[9]

Blattmann Andreas, 2022, ARXIV220411824

[10] Neuroevolutionary Feature Representations for Causal Inference [J].

Burkhart, Michael C. ;

Ruiz, Gabriel .

COMPUTATIONAL SCIENCE, ICCS 2022, PT II, 2022, :3-10

← 1 2 3 4 5 6 7 8 →