GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis

被引：78

作者：

Tao, Ming ^{[1
,2
]}

Bao, Bing-Kun ^{[1
,2
]}

Tang, Hao ^{[3
]}

Xu, Changsheng ^{[2
,4
,5
]}

机构：

[1] Nanjing Univ Posts & Telecommun, Nanjing, Peoples R China

[2] Peng Cheng Lab, Shenzhen, Peoples R China

[3] Swiss Fed Inst Technol, CVL, Zurich, Switzerland

[4] Chinese Acad Sci CASIA, Inst Automat, MAIS, Beijing, Peoples R China

[5] Univ Chinese Acad Sci UCAS, Sch Artificial Intelligence, Beijing, Peoples R China

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年

关键词：

D O I：

10.1109/CVPR52729.2023.01366

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Synthesizing high-fidelity complex images from text is challenging. Based on large pretraining, the autoregressive and diffusion models can synthesize photo-realistic images. Although these large models have shown notable progress, there remain three flaws. 1) These models require tremendous training data and parameters to achieve good performance. 2) The multi-step generation design slows the image synthesis process heavily. 3) The synthesized visual features are challenging to control and require delicately designed prompts. To enable high-quality, efficient, fast, and controllable text-to-image synthesis, we propose Generative Adversarial CLIPs, namely GALIP. GALIP leverages the powerful pretrained CLIP model both in the discriminator and generator. Specifically, we propose a CLIP-based discriminator. The complex scene understanding ability of CLIP enables the discriminator to accurately assess the image quality. Furthermore, we propose a CLIP-empowered generator that induces the visual concepts from CLIP through bridge features and prompts. The CLIP-integrated generator and discriminator boost training efficiency, and as a result, our model only requires about 3% training data and 6% learnable parameters, achieving comparable results to large pretrained autoregressive and diffusion models. Moreover, our model achieves similar to 120xfaster synthesis speed and inherits the smooth latent space from GAN. The extensive experimental results demonstrate the excellent performance of our GALIP. Code is available at https://github.com/tobran/GALIP.

引用

页码：14214 / 14223

页数：10

共 60 条

[1]

[Anonymous], 2017, NeurIPS

[2]

[Anonymous], 2021, ICML

[3]

[Anonymous], 2016, ICML

[4]

[Anonymous], 2019, CVPR, DOI DOI 10.1109/CVPR.2019.00514

[5]

Balaji Yogesh, 2022, arXiv preprint arXiv:2211.01324

[6]

Changpinyo Soravit, 2021, CVPR

[7]

Cheng WH, 2021, ACM COMPUT SURV, V54, DOI [10.1145/3447239, 10.1145/3552468.3554360]

[8]

Cong Yuren, 2023, ARXIV230101413

[9]

Dhariwal P, 2021, ADV NEUR IN, V34

[10]

Ding Ming, 2022, arXiv preprint arXiv:2204.14217

← 1 2 3 4 5 6 →