SAM-GAN: Self-Attention supporting Multi-stage Generative Adversarial Networks for text-to-image synthesis

被引:42
作者
Peng, Dunlu [1 ]
Yang, Wuchen [1 ]
Liu, Cong [1 ]
Lu, Shuairui [1 ]
机构
[1] Univ Shanghai Sci & Technol, Sch Opt Elect & Comp Engn, Shanghai Key Lab Modern Opt Syst, Shanghai 20093, Peoples R China
基金
中国国家自然科学基金;
关键词
Text-to-image synthesis; SAM-GAN; Self-attention mechanism; Machine learning;
D O I
10.1016/j.neunet.2021.01.023
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Synthesizing photo-realistic images based on text descriptions is a challenging task in the field of computer vision. Although generative adversarial networks have made significant breakthroughs in this task, they still face huge challenges in generating high-quality visually realistic images consistent with the semantics of text. Generally, existing text-to-image methods accomplish this task with two steps, that is, first generating an initial image with a rough outline and color, and then gradually yielding the image within high-resolution from the initial image. However, one drawback of these methods is that, if the quality of the initial image generation is not high, it is hard to generate a satisfactory high-resolution image. In this paper, we propose SAM-GAN, Self-Attention supporting Multi-stage Generative Adversarial Networks, for text-to-image synthesis. With the self-attention mechanism, the model can establish the multi-level dependence of the image and fuse the sentence- and word-level visual-semantic vectors, to improve the quality of the generated image. Furthermore, a multi-stage perceptual loss is introduced to enhance the semantic similarity between the synthesized image and the real image, thus enhancing the visual-semantic consistency between text and images. For the diversity of the generated images, a mode seeking regularization term is integrated into the model. The results of extensive experiments and ablation studies, which were conducted in the Caltech-UCSD Birds and Microsoft Common Objects in Context datasets, show that our model is superior to competitive models in text-to-image synthesis. (c) 2021 Elsevier Ltd. All rights reserved.
引用
收藏
页码:57 / 67
页数:11
相关论文
共 44 条
[1]   Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space [J].
Anh Nguyen ;
Clune, Jeff ;
Bengio, Yoshua ;
Dosovitskiy, Alexey ;
Yosinski, Jason .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :3510-3520
[2]  
[Anonymous], 2014, 2 INT C LEARNING REP
[3]  
[Anonymous], 2016, ABS160605328 CORR
[4]  
Arjovsky M, 2017, PR MACH LEARN RES, V70
[5]  
Avila S., 2019, ABS191013076 CORR
[6]   Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering [J].
Duy-Kien Nguyen ;
Okatani, Takayuki .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6087-6096
[7]  
Espeholt L., 2016, NEURAL INFORM PROCES, P4790
[8]   Particle swarm optimization for automatic creation of complex graphic characters [J].
Fister, Iztok, Jr. ;
Perc, Matjaz ;
Ljubic, Karin ;
Kamal, Salahuddin M. ;
Iglesias, Andres ;
Fister, Iztok .
CHAOS SOLITONS & FRACTALS, 2015, 73 :29-35
[9]   MetalGAN: Multi-domain label-less image synthesis using cGANs and meta-learning [J].
Fontanini, Tomaso ;
Iotti, Eleonora ;
Donati, Luca ;
Prati, Andrea .
NEURAL NETWORKS, 2020, 131 :185-200
[10]  
Gao LL, 2019, AAAI CONF ARTIF INTE, P8312