Text to Image Generation with Conformer-GAN

被引：0

作者：

Deng, Zhiyu ^{[1
]}

Yu, Wenxin ^{[1
]}

Che, Lu ^{[1
]}

Chen, Shiyu ^{[1
]}

Zhang, Zhiqiang ^{[1
]}

Shang, Jun ^{[1
]}

Chen, Peng ^{[2
]}

Gong, Jun ^{[3
]}

机构：

[1] Southwest Univ Sci & Technol, Mianyang, Sichuan, Peoples R China

[2] Chengdu Hongchengyun Technol Co Ltd, Chengdu, Peoples R China

[3] Southwest Automat Res Inst, Mianyang, Sichuan, Peoples R China

来源：

NEURAL INFORMATION PROCESSING, ICONIP 2023, PT V | 2024年 / 14451卷

关键词：

Text-to-Image Synthesis; Computer Vision; Deep Learning; Generative Adversarial Networks;

D O I：

10.1007/978-981-99-8073-4_1

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Text-to-image generation (T2I) has been a popular research field in recent years, and its goal is to generate corresponding photorealistic images through natural language text descriptions. Existing T2I models are mostly based on generative adversarial networks, but it is still very challenging to guarantee the semantic consistency between a given textual description and generated natural images. To address this problem, we propose a concise and practical novel framework, Conformer-GAN. Specifically, we propose the Conformer block, consisting of the Convolutional Neural Network (CNN) and Transformer branches. The CNN branch is used to generate images conditionally from noise. The Transformer branch continuously focuses on the relevant words in natural language descriptions and fuses the sentence and word information to guide the CNN branch for image generation. Our approach can better merge global and local representations to improve the semantic consistency between textual information and synthetic images. Importantly, our Conformer-GAN can generate natural and realistic 512 x 512 images. Extensive experiments on the challenging public benchmark datasets CUB bird and COCO demonstrate that our method outperforms recent state-of-the-art methods both in terms of generated image quality and text-image semantic consistency.

引用

页码：3 / 14

页数：12

共 31 条

[1]

Ba JL, 2016, arXiv

[2]

Ding Ming, 2021, Advances in Neural Information Processing Systems, V34

[3]

Goodfellow IJ, 2014, ADV NEUR IN, V27, P2672

[4]

Gou Y, 2020, ARXIV

[5]

Gu S., 2021, arXiv

[6]

Heusel M, 2017, ADV NEUR IN, V30

[7] Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis [J].

Hong, Seunghoon ;

Yang, Dingdong ;

Choi, Jongwook ;

Lee, Honglak .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7986-7994

[8] Unifying Multimodal Transformer for Bi-directional Image and Text Generation [J].

Huang, Yupan ;

Xue, Hongwei ;

Liu, Bei ;

Lu, Yutong .

PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, :1138-1147

[9]

King DB, 2015, ACS SYM SER, V1214, P1, DOI 10.1021/bk-2015-1214.ch001

[10]

Li B., 2019, ARXIV

← 1 2 3 4 →