SWF-GAN: A Text-to-Image model based on sentence-word fusion perception

被引:4
作者
Liu, Chun [1 ]
Hu, Jingsong [1 ]
Lin, Hong [1 ]
机构
[1] Wuhan Univ Technol, Wuhan 430070, Peoples R China
来源
COMPUTERS & GRAPHICS-UK | 2023年 / 115卷
关键词
Text to Image; Image generation; Generative adversarial network;
D O I
10.1016/j.cag.2023.07.038
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Synthesizing images from descriptive text is an exciting and challenging task in multimodal deep learning, which has broad prospects of application in the fields of visual reasoning, image editing, style migrating, and so on. This paper proposes SWF-GAN to solve such problems: the limited constraint of coarse-grained information leads to difficulties in building semantic mappings of text-to-image accurately and ordinary mask predictors do not have enough representational capacity to accurately perceive the global information of images. SWF-GAN designs a sentence-word fusion perceptual module which divides the semantic perception of the generative model into two major layers, sentence and word, building affine transformations to constrain image synthesis using the coarse-grained feature on the sentence level and specific image synthesis using the fine-grained feature on the word level. Additionally, a weakly supervised coordinate mask predictor is employed in the sentence layer, extracting long-range dependencies with precise positional information vertically and horizontally to assign more information to the subject in the complex image background thus accurately generating the structure of the target object. The experiments show that the sentence-word fusion perceptual generative adversarial network model proposed in this paper can generate clearer and more lively images without a heavy computational burden. Compared with the baseline model, the proposed model improves the IS and FID scores by 0.97% and 22.95% respectively, and the experimental results on different datasets and the ablation study results show the effectiveness of our model.& COPY; 2023 Elsevier Ltd. All rights reserved.
引用
收藏
页码:500 / 510
页数:11
相关论文
共 46 条
  • [1] A survey and taxonomy of adversarial neural networks for text-to-image synthesis
    Agnese, Jorge
    Herrera, Jonathan
    Tao, Haicheng
    Zhu, Xingquan
    [J]. WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2020, 10 (04)
  • [2] Chen WH, 2022, Arxiv, DOI arXiv:2209.14491
  • [3] Ding M., 2021, Advances in Neural Information Processing Systems, V34
  • [4] Ding M, 2022, Arxiv, DOI [arXiv:2204.14217, 10.48550/arXiv.2204.14217]
  • [5] Adversarial text-to-image synthesis: A review
    Frolov, Stanislav
    Hinz, Tobias
    Raue, Federico
    Hees, Joern
    Dengel, Andreas
    [J]. NEURAL NETWORKS, 2021, 144 : 187 - 209
  • [6] Goodfellow IJ, 2014, ADV NEUR IN, V27, P2672
  • [7] Heusel M, 2017, ADV NEUR IN, V30
  • [8] Coordinate Attention for Efficient Mobile Network Design
    Hou, Qibin
    Zhou, Daquan
    Feng, Jiashi
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 13708 - 13717
  • [9] Strip Pooling: Rethinking Spatial Pooling for Scene Parsing
    Hou, Qibin
    Zhang, Li
    Cheng, Ming-Ming
    Feng, Jiashi
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 4002 - 4011
  • [10] Huiyu Wang, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12349), P108, DOI 10.1007/978-3-030-58548-8_7