SpaText: Spatio-Textual Representation for Controllable Image Generation

被引:46
作者
Avrahami, Omri [1 ,2 ]
Hayes, Thomas [1 ]
Gafni, Oran [1 ]
Gupta, Sonal [1 ]
Taigman, Yaniv [1 ]
Parikh, Devi [1 ]
Lischinski, Dani [2 ]
Fried, Ohad [3 ]
Yin, Xi [1 ]
机构
[1] Meta AI, London, England
[2] Hebrew Univ Jerusalem, Jerusalem, Israel
[3] Reichman Univ, Herzliyya, Israel
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年
关键词
D O I
10.1109/CVPR52729.2023.01762
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent text-to-image diffusion models are able to generate convincing results of unprecedented quality. However, it is nearly impossible to control the shapes of different regions/objects or their layout in a fine-grained fashion. Previous attempts to provide such controls were hindered by their reliance on a fixed set of labels. To this end, we present SpaText - a new method for text-to-image generation using open-vocabulary scene control. In addition to a global text prompt that describes the entire scene, the user provides a segmentation map where each region of interest is annotated by a free-form natural language description. Due to lack of large-scale datasets that have a detailed textual description for each region in the image, we choose to leverage the current large-scale text-to-image datasets and base our approach on a novel CLIP-based spatio-textual representation, and show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-based. In addition, we show how to extend the classifier-free guidance method in diffusion models to the multi-conditional case and present an alternative accelerated inference algorithm. Finally, we offer several automatic evaluation metrics and use them, in addition to FID scores and a user study, to evaluate our method and show that it achieves state-of-the-art results on image generation with free-form textual scene control.
引用
收藏
页码:18370 / 18380
页数:11
相关论文
共 75 条
  • [1] Ackermann Johannes, 2022, ARXIV221012965
  • [2] Amazon, 2022, Amazon Mechanical Turk
  • [3] [Anonymous], 2015, IEEE I CONF COMP VIS, DOI DOI 10.1109/ICCV.2015.123
  • [4] Ashual Oron, 2022, ARXIV220402849
  • [5] Avrahami Omri, 2022, ARXIV220602779
  • [6] Avrahami Omri, 2022, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, P18208
  • [7] Bar-Tal Omer, 2022, ARXIV220402491
  • [8] Bau D., 2021, arXiv preprint arXiv:2103.10951
  • [9] Blattmann Andreas, 2022, ARXIV220411824
  • [10] Neuroevolutionary Feature Representations for Causal Inference
    Burkhart, Michael C.
    Ruiz, Gabriel
    [J]. COMPUTATIONAL SCIENCE, ICCS 2022, PT II, 2022, : 3 - 10