Vision-Language Matching for Text-to-Image Synthesis via Generative Adversarial Networks

被引：6

作者：

Cheng, Qingrong ^{[1
]}

Wen, Keyu ^{[1
]}

Gu, Xiaodong ^{[1
]}

机构：

[1] Fudan Univ, Dept Elect Engn, Shanghai 200433, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2023年 / 25卷

基金：

中国国家自然科学基金;

关键词：

Semantics; Image synthesis; Visualization; Task analysis; Measurement; Generative adversarial networks; Image quality; text-to-image synthesis; vision-language matching; ATTENTION; GAN;

D O I：

10.1109/TMM.2022.3217384

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Text-to-image synthesis is an attractive but challenging task that aims to generate a photo-realistic and semantic consistent image from a specific text description. The images synthesized by off-the-shelf models usually contain limited components compared with the corresponding image and text description, which decreases the image quality and the textual-visual consistency. To address this issue, we propose a novel Vision-Language Matching strategy for text-to-image synthesis, named VLMGAN*, which introduces a dual vision-language matching mechanism to strengthen the image quality and semantic consistency. The dual vision-language matching mechanism considers textual-visual matching between the generated image and the corresponding text description, and visual-visual consistent constraints between the synthesized image and the real image. Given a specific text description, VLMGAN* firstly encodes it into textual features and then feeds them to a dual vision-language matching-based generative model to synthesize a photo-realistic and textual semantic consistent image. Besides, the popular evaluation metrics for text-to-image synthesis are borrowed from simple image generation, which mainly evaluate the reality and diversity of the synthesized images. Therefore, we introduce a metric named Vision-Language Matching Score (VLMS) to evaluate the performance of text-to-image synthesis which can consider both the image quality and the semantic consistency between the synthesized image and the description. The proposed dual multi-level vision-language matching strategy can be applied to other text-to-image synthesis methods. We implement this strategy on two popular baselines, which are marked with VLMGAN(+AttnGAN )and VLMGAN(+DFGAN). The experimental results on two widely-used datasets show that the model achieves significant improvements over other state-of-the-art methods.

引用

页码：7062 / 7075

页数：14

共 74 条

[1] Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space [J].

Anh Nguyen ;

Clune, Jeff ;

Bengio, Yoshua ;

Dosovitskiy, Alexey ;

Yosinski, Jason .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :3510-3520

[2]

[Anonymous], 2018, P INT C LEARN REPR, P1

[3]

Brock A., 2018, P 7 INT C LEARN REPR

[4] UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog [J].

Chen, Cheng ;

Tan, Zhenshan ;

Cheng, Qingrong ;

Jiang, Xin ;

Liu, Qun ;

Zhu, Yudong ;

Gu, Xiaodong .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :18082-18091

[5] UNITER: UNiversal Image-TExt Representation Learning [J].

Chen, Yen-Chun ;

Li, Linjie ;

Yu, Licheng ;

El Kholy, Ahmed ;

Ahmed, Faisal ;

Gan, Zhe ;

Cheng, Yu ;

Liu, Jingjing .

COMPUTER VISION - ECCV 2020, PT XXX, 2020, 12375 :104-120

[6] RiFeGAN: Rich Feature Generation for Text-to-Image Synthesis from Prior Knowledge [J].

Cheng, Jun ;

Wu, Fuxiang ;

Tian, Yanling ;

Wang, Lei ;

Tao, Dapeng .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10908-10917

[7] Semantic Pre-Alignment and Ranking Learning With Unified Framework for Cross-Modal Retrieval [J].

Cheng, Qingrong ;

Tan, Zhenshan ;

Wen, Keyu ;

Chen, Cheng ;

Gu, Xiaodong .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) :6503-6516

[8] Bridging multimedia heterogeneity gap via Graph Representation Learning for cross-modal retrieval [J].

Cheng, Qingrong ;

Gu, Xiaodong .

NEURAL NETWORKS, 2021, 134 :143-162

[9] Cross-modal Feature Alignment based Hybrid Attentional Generative Adversarial Networks for text-to-image synthesis [J].

Cheng, Qingrong ;

Gu, Xiaodong .

DIGITAL SIGNAL PROCESSING, 2020, 107

[10] Hybrid Attention Driven Text-to-Image Synthesis via Generative Adversarial Networks [J].

Cheng, Qingrong ;

Gu, Xiaodong .

ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2019: WORKSHOP AND SPECIAL SESSIONS, 2019, 11731 :483-495

← 1 2 3 4 5 6 7 8 →