MFECLIP: CLIP With Mapping-Fusion Embedding for Text-Guided Image Editing

被引:6
作者
Wu, Fei [1 ,2 ]
Ma, Yongheng [1 ,2 ]
Jin, Hao [1 ,2 ]
Jing, Xiao-Yuan [3 ]
Jiang, Guo-Ping [1 ,2 ]
机构
[1] Nanjing Univ Posts & Telecommun, Coll Automat, Nanjing 210049, Peoples R China
[2] Nanjing Univ Posts & Telecommun, Coll Artificial Intelligence, Nanjing 210049, Peoples R China
[3] Wuhan Univ, Sch Comp Sci, Wuhan 430072, Peoples R China
关键词
Semantics; Generative adversarial networks; Training; Task analysis; Flowering plants; Birds; Telecommunications; Text-guided image editing; GAN; CLIP;
D O I
10.1109/LSP.2023.3342649
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Recently, generative adversarial networks (GAN) have made remarkable progress, particularly with the advent of Contrastive Language-Image Pretraining (CLIP), which take image and text into a joint latent space, bridging the gap between these two modalities. Several impressive text-guided image editing methods based on GANs and CLIP have emerged. However, in these studies, most of them simply minimize the distance between the target image embedding and text embedding in the CLIP space, and take this objective as network's optimization goal, overlooking the real distance between them may be large. This may result in inability to accurately guide the editing process according to the text prompts and the changes in text-irrelevant attributes. To mitigate this issue, we propose a novel approach named CLIP with Mapping-Fusion Embedding (MFECLIP) for text-guided image editing, which comprises two components: the MFE Block and MFE Loss. Through the MFE Block, we obtain Mapping-Fusion Embedding (MFE), which can further eliminate the modality gap, and it can serve as a superior guide for editing process instead of the original text embedding. Based on contrastive learning, the MFE Loss is designed to achieve accurate alignment between the target image and text prompt. We have conducted extensive experiments on real datasets, CUB and Oxford, demonstrating the favorable performance of the proposed method.
引用
收藏
页码:116 / 120
页数:5
相关论文
共 42 条
[1]   Semantic Image Synthesis via Adversarial Learning [J].
Dong, Hao ;
Yu, Simiao ;
Wu, Chao ;
Guo, Yike .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :CP1-CP38
[2]  
Goodfellow IJ, 2014, ADV NEUR IN, V27, P2672
[3]   AUDIOCLIP: EXTENDING CLIP TO IMAGE, TEXT AND AUDIO [J].
Guzhov, Andrey ;
Raue, Federico ;
Hees, Joern ;
Dengel, Andreas .
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :976-980
[4]  
Heusel M, 2017, ADV NEUR IN, V30
[5]   Scaling up GANs for Text-to-Image Synthesis [J].
Kang, Minguk ;
Zhu, Jun-Yan ;
Zhang, Richard ;
Park, Jaesik ;
Shechtman, Eli ;
Paris, Sylvain ;
Park, Taesung .
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :10124-10134
[6]   A Style-Based Generator Architecture for Generative Adversarial Networks [J].
Karras, Tero ;
Laine, Samuli ;
Aila, Timo .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :4396-4405
[7]  
Kingma J., 2015, INT C LEARN REPRESEN, P1
[8]   StyleMC: Multi-Channel Based Fast Text-Guided Image Generation and Manipulation [J].
Kocasari, Umut ;
Dirik, Alara ;
Tiftikci, Mert ;
Yanardag, Pinar .
2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, :3441-3450
[9]   ManiGAN: Text-Guided Image Manipulation [J].
Li, Bowen ;
Qi, Xiaojuan ;
Lukasiewicz, Thomas ;
Torr, Philip H. S. .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :7877-7886
[10]  
Li Bowen, 2020, Advances in Neural Information Processing Systems, V33