MF-GAN: Multi-conditional Fusion Generative Adversarial Network for Text-to-Image Synthesis

被引:4
作者
Yang, Yuyan [1 ,2 ]
Ni, Xin [1 ,2 ]
Hao, Yanbin [1 ,2 ]
Liu, Chenyu [3 ]
Wang, Wenshan [3 ]
Liu, Yifeng [3 ]
Xie, Haiyong [2 ,4 ]
机构
[1] Univ Sci & Technol China, Hefei 230026, Anhui, Peoples R China
[2] Minist Culture & Tourism, Key Lab Cyberculture Content Cognit & Detect, Hefei 230026, Anhui, Peoples R China
[3] Natl Engn Lab Risk Percept & Prevent NEL RPP, Beijing 100041, Peoples R China
[4] Capital Med Univ, Adv Innovat Ctr Human Brain Protect, Beijing 100069, Peoples R China
来源
MULTIMEDIA MODELING (MMM 2022), PT I | 2022年 / 13141卷
关键词
Text-to-Image; GAN; Triplet loss;
D O I
10.1007/978-3-030-98358-1_4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The performance of text-to-image synthesis has been significantly boosted accompanied by the development of generative adversarial network (GAN) techniques. The current GAN-based methods for text-to-image generation mainly adopt multiple generator-discriminator pairs to explore the coarse/fine-grained textual content (e.g., words and sentences); however, they only consider the semantic consistency between the text-image pair. One drawback of such a multi-stream structure is that it results in many heavyweight models. In comparison, the single-stream counterpart bears the weakness of insufficient use of texts. To alleviate the above problems, we propose a Multi-conditional Fusion GAN (MF-GAN) to reap the benefits of both the multi-stream and the single-stream methods. MF-GAN is a single-stream model but achieves the utilization of both coarse and fine-grained textual information with the use of conditional residual block and dual attention block. More specifically, the sentence and word features are repeatedly inputted into different model stages for textual information enhancement. Furthermore, we introduce a triple loss to close the visual gap between the synthesized image and its positive image and enlarge the gap to its negative image. To thoroughly verify our method, we conduct extensive experiments on two benchmarked CUB and COCO datasets. Experimental results show that the proposed MF-GAN outperforms the state-of-the-art methods.
引用
收藏
页码:41 / 53
页数:13
相关论文
共 20 条
  • [1] Gulcehre C, 2018, NEURAL COMPUT, V30, P857, DOI [10.1162/neco_a_01060, 10.1162/NECO_a_01060]
  • [2] Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis
    Hong, Seunghoon
    Yang, Dingdong
    Choi, Jongwook
    Lee, Honglak
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7986 - 7994
  • [3] Jiadong Liang, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12349), P491, DOI 10.1007/978-3-030-58548-8_29
  • [4] Image Generation from Scene Graphs
    Johnson, Justin
    Gupta, Agrim
    Li Fei-Fei
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 1219 - 1228
  • [5] Li B, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P1991
  • [6] Learning to Learn Relation for Important People Detection in Still Images
    Li, Wei-Hong
    Hong, Fa-Ting
    Zheng, Wei-Shi
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 4998 - 5006
  • [7] Microsoft COCO: Common Objects in Context
    Lin, Tsung-Yi
    Maire, Michael
    Belongie, Serge
    Hays, James
    Perona, Pietro
    Ramanan, Deva
    Dollar, Piotr
    Zitnick, C. Lawrence
    [J]. COMPUTER VISION - ECCV 2014, PT V, 2014, 8693 : 740 - 755
  • [8] Melekhov I, 2016, INT C PATT RECOG, P378, DOI 10.1109/ICPR.2016.7899663
  • [9] MirrorGAN: Learning Text-to-image Generation by Redescription
    Qiao, Tingting
    Zhang, Jing
    Xu, Duanqing
    Tao, Dacheng
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 1505 - 1514
  • [10] Reed S, 2016, PR MACH LEARN RES, V48