LisaCLIP: Locally Incremental Semantics Adaptation towards Zero-shot Text-driven Image Synthesis

被引:2
作者
Cao, An [1 ]
Zhou, Yilin [1 ]
Shen, Gang [1 ]
机构
[1] Huazhong Univ Sci & Technol, Sch Software Engn, Wuhan, Peoples R China
来源
2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN | 2023年
关键词
image synthesis; style transfer; CLIP model; adaptive patch selection;
D O I
10.1109/IJCNN54540.2023.10191516
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The automatic transfer of a plain photo into a desired synthetic style has attracted numerous users in the application fields of photo editing, visual art, and entertainment. By connecting images and texts, the Contrastive Language-Image Pre-Training (CLIP) model facilitates the text-driven style transfer without exploring the image's latent domain. However, the trade-off between content fidelity and stylization remains challenging. In this paper, we present LisaCLIP, a CLIP-based image synthesis framework that only exploits the CLIP model to guide the imagery manipulations with a depth-adaptive encoder-decoder network. Since an image patch's semantics depend on its size, LisaCLIP progressively downsizes the patches while adaptively selecting the most significant ones for further stylization. We introduce a multi-stage training strategy to speed up LisaCLIP's convergence by decoupling the optimization objectives. Various experiments on public datasets demonstrated that LisaCLIP supported a wide range of style transfer tasks and outperformed other state-of-the-art methods in maintaining the balance between content and style.
引用
收藏
页数:10
相关论文
共 26 条
[21]  
Sanghi Aditya, 2022, P IEEECVF C COMPUTER, P18603
[22]   Interpreting the Latent Space of GANs for Semantic Face Editing [J].
Shen, Yujun ;
Gu, Jinjin ;
Tang, Xiaoou ;
Zhou, Bolei .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :9240-9249
[23]  
Song Y., 2022, CLIPVG TEXT GUIDED I, DOI [10.48550/arXiv.2212.02122, DOI 10.48550/ARXIV.2212.02122]
[24]   ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [J].
Tewel, Yoad ;
Shalev, Yoav ;
Schwartz, Idan ;
Wolf, Lior .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :17897-17907
[25]   CCPL: Contrastive Coherence Preserving Loss for Versatile Style Transfer [J].
Wu, Zijie ;
Zhu, Zhen ;
Du, Junping ;
Bai, Xiang .
COMPUTER VISION - ECCV 2022, PT XVI, 2022, 13676 :189-206
[26]  
Yang Z., 2022, ABS220711598 ARXIV