SWF-GAN: A Text-to-Image model based on sentence-word fusion perception

被引：4

作者：

Liu, Chun ^{[1
]}

Hu, Jingsong ^{[1
]}

Lin, Hong ^{[1
]}

机构：

[1] Wuhan Univ Technol, Wuhan 430070, Peoples R China

来源：

COMPUTERS & GRAPHICS-UK | 2023年 / 115卷

关键词：

Text to Image; Image generation; Generative adversarial network;

D O I：

10.1016/j.cag.2023.07.038

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Synthesizing images from descriptive text is an exciting and challenging task in multimodal deep learning, which has broad prospects of application in the fields of visual reasoning, image editing, style migrating, and so on. This paper proposes SWF-GAN to solve such problems: the limited constraint of coarse-grained information leads to difficulties in building semantic mappings of text-to-image accurately and ordinary mask predictors do not have enough representational capacity to accurately perceive the global information of images. SWF-GAN designs a sentence-word fusion perceptual module which divides the semantic perception of the generative model into two major layers, sentence and word, building affine transformations to constrain image synthesis using the coarse-grained feature on the sentence level and specific image synthesis using the fine-grained feature on the word level. Additionally, a weakly supervised coordinate mask predictor is employed in the sentence layer, extracting long-range dependencies with precise positional information vertically and horizontally to assign more information to the subject in the complex image background thus accurately generating the structure of the target object. The experiments show that the sentence-word fusion perceptual generative adversarial network model proposed in this paper can generate clearer and more lively images without a heavy computational burden. Compared with the baseline model, the proposed model improves the IS and FID scores by 0.97% and 22.95% respectively, and the experimental results on different datasets and the ablation study results show the effectiveness of our model.& COPY; 2023 Elsevier Ltd. All rights reserved.

引用

页码：500 / 510

页数：11

共 46 条

[1] A survey and taxonomy of adversarial neural networks for text-to-image synthesis
Agnese, Jorge
Herrera, Jonathan
Tao, Haicheng
Zhu, Xingquan
[J]. WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2020, 10 (04)
[2] Chen WH, 2022, Arxiv, DOI arXiv:2209.14491
[3] Ding M., 2021, Advances in Neural Information Processing Systems, V34
[4] Ding M, 2022, Arxiv, DOI [arXiv:2204.14217, 10.48550/arXiv.2204.14217]
[5] Adversarial text-to-image synthesis: A review
Frolov, Stanislav
Hinz, Tobias
Raue, Federico
Hees, Joern
Dengel, Andreas
[J]. NEURAL NETWORKS, 2021, 144 : 187 - 209
[6] Goodfellow IJ, 2014, ADV NEUR IN, V27, P2672
[7] Heusel M, 2017, ADV NEUR IN, V30
[8] Coordinate Attention for Efficient Mobile Network Design
Hou, Qibin
Zhou, Daquan
Feng, Jiashi
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 13708 - 13717
[9] Strip Pooling: Rethinking Spatial Pooling for Scene Parsing
Hou, Qibin
Zhang, Li
Cheng, Ming-Ming
Feng, Jiashi
[J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 4002 - 4011
[10] Huiyu Wang, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12349), P108, DOI 10.1007/978-3-030-58548-8_7

← 1 2 3 4 5 →