LisaCLIP: Locally Incremental Semantics Adaptation towards Zero-shot Text-driven Image Synthesis

被引：2

作者：

Cao, An ^{[1
]}

Zhou, Yilin ^{[1
]}

Shen, Gang ^{[1
]}

机构：

[1] Huazhong Univ Sci & Technol, Sch Software Engn, Wuhan, Peoples R China

来源：

2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN | 2023年

关键词：

image synthesis; style transfer; CLIP model; adaptive patch selection;

D O I：

10.1109/IJCNN54540.2023.10191516

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The automatic transfer of a plain photo into a desired synthetic style has attracted numerous users in the application fields of photo editing, visual art, and entertainment. By connecting images and texts, the Contrastive Language-Image Pre-Training (CLIP) model facilitates the text-driven style transfer without exploring the image's latent domain. However, the trade-off between content fidelity and stylization remains challenging. In this paper, we present LisaCLIP, a CLIP-based image synthesis framework that only exploits the CLIP model to guide the imagery manipulations with a depth-adaptive encoder-decoder network. Since an image patch's semantics depend on its size, LisaCLIP progressively downsizes the patches while adaptively selecting the most significant ones for further stylization. We introduce a multi-stage training strategy to speed up LisaCLIP's convergence by decoupling the optimization objectives. Various experiments on public datasets demonstrated that LisaCLIP supported a wide range of style transfer tasks and outperformed other state-of-the-art methods in maintaining the balance between content and style.

引用

页数：10

共 26 条

[21]

Sanghi Aditya, 2022, P IEEECVF C COMPUTER, P18603

[22] Interpreting the Latent Space of GANs for Semantic Face Editing [J].

Shen, Yujun ;

Gu, Jinjin ;

Tang, Xiaoou ;

Zhou, Bolei .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :9240-9249

[23]

Song Y., 2022, CLIPVG TEXT GUIDED I, DOI [10.48550/arXiv.2212.02122, DOI 10.48550/ARXIV.2212.02122]

[24] ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [J].

Tewel, Yoad ;

Shalev, Yoav ;

Schwartz, Idan ;

Wolf, Lior .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :17897-17907

[25] CCPL: Contrastive Coherence Preserving Loss for Versatile Style Transfer [J].

Wu, Zijie ;

Zhu, Zhen ;

Du, Junping ;

Bai, Xiang .

COMPUTER VISION - ECCV 2022, PT XVI, 2022, 13676 :189-206

[26]

Yang Z., 2022, ABS220711598 ARXIV

← 1 2 3 →