SpaText: Spatio-Textual Representation for Controllable Image Generation

被引:46
作者
Avrahami, Omri [1 ,2 ]
Hayes, Thomas [1 ]
Gafni, Oran [1 ]
Gupta, Sonal [1 ]
Taigman, Yaniv [1 ]
Parikh, Devi [1 ]
Lischinski, Dani [2 ]
Fried, Ohad [3 ]
Yin, Xi [1 ]
机构
[1] Meta AI, London, England
[2] Hebrew Univ Jerusalem, Jerusalem, Israel
[3] Reichman Univ, Herzliyya, Israel
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年
关键词
D O I
10.1109/CVPR52729.2023.01762
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent text-to-image diffusion models are able to generate convincing results of unprecedented quality. However, it is nearly impossible to control the shapes of different regions/objects or their layout in a fine-grained fashion. Previous attempts to provide such controls were hindered by their reliance on a fixed set of labels. To this end, we present SpaText - a new method for text-to-image generation using open-vocabulary scene control. In addition to a global text prompt that describes the entire scene, the user provides a segmentation map where each region of interest is annotated by a free-form natural language description. Due to lack of large-scale datasets that have a detailed textual description for each region in the image, we choose to leverage the current large-scale text-to-image datasets and base our approach on a novel CLIP-based spatio-textual representation, and show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-based. In addition, we show how to extend the classifier-free guidance method in diffusion models to the multi-conditional case and present an alternative accelerated inference algorithm. Finally, we offer several automatic evaluation metrics and use them, in addition to FID scores and a user study, to evaluate our method and show that it achieves state-of-the-art results on image generation with free-form textual scene control.
引用
收藏
页码:18370 / 18380
页数:11
相关论文
共 75 条
  • [41] Kwon Mingi, 2022, ARXIV221010960
  • [42] Mixture of dimethylaminobenzaldehyde and cyanoacrylate to develop fingerprints with fluorescence: a preliminary test
    Lee, Wonyoung
    An, Jaeyoung
    Yu, Jeseol
    [J]. ANALYTICAL SCIENCE AND TECHNOLOGY, 2022, 35 (01) : 1 - 7
  • [43] Learning to Learn Relation for Important People Detection in Still Images
    Li, Wei-Hong
    Hong, Fa-Ting
    Zheng, Wei-Shi
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 4998 - 5006
  • [44] Microsoft COCO: Common Objects in Context
    Lin, Tsung-Yi
    Maire, Michael
    Belongie, Serge
    Hays, James
    Perona, Pietro
    Ramanan, Deva
    Dollar, Piotr
    Zitnick, C. Lawrence
    [J]. COMPUTER VISION - ECCV 2014, PT V, 2014, 8693 : 740 - 755
  • [45] Long J, 2015, PROC CVPR IEEE, P3431, DOI 10.1109/CVPR.2015.7298965
  • [46] Nichol A., 2021, arXiv preprint arXiv:2112.10741
  • [47] Nichol A, 2021, PR MACH LEARN RES, V139
  • [48] Automated flower classification over a large number of classes
    Nilsback, Maria-Elena
    Zisserman, Andrew
    [J]. SIXTH INDIAN CONFERENCE ON COMPUTER VISION, GRAPHICS & IMAGE PROCESSING ICVGIP 2008, 2008, : 722 - 729
  • [49] Paiss Roni, 2022, ARXIV220404908
  • [50] StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery
    Patashnik, Or
    Wu, Zongze
    Shechtman, Eli
    Cohen-Or, Daniel
    Lischinski, Dani
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 2065 - 2074