Vision-Language Matching for Text-to-Image Synthesis via Generative Adversarial Networks

被引:6
作者
Cheng, Qingrong [1 ]
Wen, Keyu [1 ]
Gu, Xiaodong [1 ]
机构
[1] Fudan Univ, Dept Elect Engn, Shanghai 200433, Peoples R China
基金
中国国家自然科学基金;
关键词
Semantics; Image synthesis; Visualization; Task analysis; Measurement; Generative adversarial networks; Image quality; text-to-image synthesis; vision-language matching; ATTENTION; GAN;
D O I
10.1109/TMM.2022.3217384
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text-to-image synthesis is an attractive but challenging task that aims to generate a photo-realistic and semantic consistent image from a specific text description. The images synthesized by off-the-shelf models usually contain limited components compared with the corresponding image and text description, which decreases the image quality and the textual-visual consistency. To address this issue, we propose a novel Vision-Language Matching strategy for text-to-image synthesis, named VLMGAN*, which introduces a dual vision-language matching mechanism to strengthen the image quality and semantic consistency. The dual vision-language matching mechanism considers textual-visual matching between the generated image and the corresponding text description, and visual-visual consistent constraints between the synthesized image and the real image. Given a specific text description, VLMGAN* firstly encodes it into textual features and then feeds them to a dual vision-language matching-based generative model to synthesize a photo-realistic and textual semantic consistent image. Besides, the popular evaluation metrics for text-to-image synthesis are borrowed from simple image generation, which mainly evaluate the reality and diversity of the synthesized images. Therefore, we introduce a metric named Vision-Language Matching Score (VLMS) to evaluate the performance of text-to-image synthesis which can consider both the image quality and the semantic consistency between the synthesized image and the description. The proposed dual multi-level vision-language matching strategy can be applied to other text-to-image synthesis methods. We implement this strategy on two popular baselines, which are marked with VLMGAN(+AttnGAN )and VLMGAN(+DFGAN). The experimental results on two widely-used datasets show that the model achieves significant improvements over other state-of-the-art methods.
引用
收藏
页码:7062 / 7075
页数:14
相关论文
共 74 条
[1]   Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space [J].
Anh Nguyen ;
Clune, Jeff ;
Bengio, Yoshua ;
Dosovitskiy, Alexey ;
Yosinski, Jason .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :3510-3520
[2]  
[Anonymous], 2018, P INT C LEARN REPR, P1
[3]  
Brock A., 2018, P 7 INT C LEARN REPR
[4]   UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog [J].
Chen, Cheng ;
Tan, Zhenshan ;
Cheng, Qingrong ;
Jiang, Xin ;
Liu, Qun ;
Zhu, Yudong ;
Gu, Xiaodong .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :18082-18091
[5]   UNITER: UNiversal Image-TExt Representation Learning [J].
Chen, Yen-Chun ;
Li, Linjie ;
Yu, Licheng ;
El Kholy, Ahmed ;
Ahmed, Faisal ;
Gan, Zhe ;
Cheng, Yu ;
Liu, Jingjing .
COMPUTER VISION - ECCV 2020, PT XXX, 2020, 12375 :104-120
[6]   RiFeGAN: Rich Feature Generation for Text-to-Image Synthesis from Prior Knowledge [J].
Cheng, Jun ;
Wu, Fuxiang ;
Tian, Yanling ;
Wang, Lei ;
Tao, Dapeng .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10908-10917
[7]   Semantic Pre-Alignment and Ranking Learning With Unified Framework for Cross-Modal Retrieval [J].
Cheng, Qingrong ;
Tan, Zhenshan ;
Wen, Keyu ;
Chen, Cheng ;
Gu, Xiaodong .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) :6503-6516
[8]   Bridging multimedia heterogeneity gap via Graph Representation Learning for cross-modal retrieval [J].
Cheng, Qingrong ;
Gu, Xiaodong .
NEURAL NETWORKS, 2021, 134 :143-162
[9]   Cross-modal Feature Alignment based Hybrid Attentional Generative Adversarial Networks for text-to-image synthesis [J].
Cheng, Qingrong ;
Gu, Xiaodong .
DIGITAL SIGNAL PROCESSING, 2020, 107
[10]   Hybrid Attention Driven Text-to-Image Synthesis via Generative Adversarial Networks [J].
Cheng, Qingrong ;
Gu, Xiaodong .
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2019: WORKSHOP AND SPECIAL SESSIONS, 2019, 11731 :483-495