MF-GAN: Multi-conditional Fusion Generative Adversarial Network for Text-to-Image Synthesis

被引：4

作者：

Yang, Yuyan ^{[1
,2
]}

Ni, Xin ^{[1
,2
]}

Hao, Yanbin ^{[1
,2
]}

Liu, Chenyu ^{[3
]}

Wang, Wenshan ^{[3
]}

Liu, Yifeng ^{[3
]}

Xie, Haiyong ^{[2
,4
]}

机构：

[1] Univ Sci & Technol China, Hefei 230026, Anhui, Peoples R China

[2] Minist Culture & Tourism, Key Lab Cyberculture Content Cognit & Detect, Hefei 230026, Anhui, Peoples R China

[3] Natl Engn Lab Risk Percept & Prevent NEL RPP, Beijing 100041, Peoples R China

[4] Capital Med Univ, Adv Innovat Ctr Human Brain Protect, Beijing 100069, Peoples R China

来源：

MULTIMEDIA MODELING (MMM 2022), PT I | 2022年 / 13141卷

关键词：

Text-to-Image; GAN; Triplet loss;

D O I：

10.1007/978-3-030-98358-1_4

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The performance of text-to-image synthesis has been significantly boosted accompanied by the development of generative adversarial network (GAN) techniques. The current GAN-based methods for text-to-image generation mainly adopt multiple generator-discriminator pairs to explore the coarse/fine-grained textual content (e.g., words and sentences); however, they only consider the semantic consistency between the text-image pair. One drawback of such a multi-stream structure is that it results in many heavyweight models. In comparison, the single-stream counterpart bears the weakness of insufficient use of texts. To alleviate the above problems, we propose a Multi-conditional Fusion GAN (MF-GAN) to reap the benefits of both the multi-stream and the single-stream methods. MF-GAN is a single-stream model but achieves the utilization of both coarse and fine-grained textual information with the use of conditional residual block and dual attention block. More specifically, the sentence and word features are repeatedly inputted into different model stages for textual information enhancement. Furthermore, we introduce a triple loss to close the visual gap between the synthesized image and its positive image and enlarge the gap to its negative image. To thoroughly verify our method, we conduct extensive experiments on two benchmarked CUB and COCO datasets. Experimental results show that the proposed MF-GAN outperforms the state-of-the-art methods.

引用

页码：41 / 53

页数：13

共 20 条

[1] Gulcehre C, 2018, NEURAL COMPUT, V30, P857, DOI [10.1162/neco_a_01060, 10.1162/NECO_a_01060]
[2] Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis
Hong, Seunghoon
Yang, Dingdong
Choi, Jongwook
Lee, Honglak
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7986 - 7994
[3] Jiadong Liang, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12349), P491, DOI 10.1007/978-3-030-58548-8_29
[4] Image Generation from Scene Graphs
Johnson, Justin
Gupta, Agrim
Li Fei-Fei
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 1219 - 1228
[5] Li B, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P1991
[6] Learning to Learn Relation for Important People Detection in Still Images
Li, Wei-Hong
Hong, Fa-Ting
Zheng, Wei-Shi
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 4998 - 5006
[7] Microsoft COCO: Common Objects in Context
Lin, Tsung-Yi
Maire, Michael
Belongie, Serge
Hays, James
Perona, Pietro
Ramanan, Deva
Dollar, Piotr
Zitnick, C. Lawrence
[J]. COMPUTER VISION - ECCV 2014, PT V, 2014, 8693 : 740 - 755
[8] Melekhov I, 2016, INT C PATT RECOG, P378, DOI 10.1109/ICPR.2016.7899663
[9] MirrorGAN: Learning Text-to-image Generation by Redescription
Qiao, Tingting
Zhang, Jing
Xu, Duanqing
Tao, Dacheng
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 1505 - 1514
[10] Reed S, 2016, PR MACH LEARN RES, V48

← 1 2 →