An Efficient Transformer Based on Global and Local Self-Attention for Face Photo-Sketch Synthesis

被引：13

作者：

Yu, Wangbo ^{[1
]}

Zhu, Mingrui ^{[1
]}

Wang, Nannan ^{[1
]}

Wang, Xiaoyu ^{[2
]}

Gao, Xinbo ^{[3
]}

机构：

[1] Xidian Univ, Sch Telecommun Engn, State Key Lab Integrated Serv Networks, Xian 710071, Shaanxi, Peoples R China

[2] Univ Sci & Technol China, Sch Comp Sci & Technol, Hefei 230027, Anhui, Peoples R China

[3] Chongqing Univ Posts & Telecommun, Chongqing Key Lab Image Cognit, Chongqing 400065, Peoples R China

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2023年 / 32卷

基金：

中国国家自然科学基金;

关键词：

Face photo-sketch synthesis; transformer; global self-attention; local self-attention; generative adversarial networks (GANs);

D O I：

10.1109/TIP.2022.3229614

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Face photo-sketch synthesis tasks have been dominated by convolutional neural networks (CNNs), especially CNN-based generative adversarial networks (GANs), because of their strong texture modeling capabilities and thus their ability to generate more realistic face photos/sketches beyond traditional methods. However, due to CNNs' locality and spatial invariance properties, there have weaknesses in capturing the global and structural information which are extremely important for face images. Inspired by the recent phenomenal success of the Transformer in vision tasks, we propose replacing CNNs with Transformers that are able to model long-range dependencies to synthesize more structured and realistic face images. However, the existing vision Transformers are mainly designed for high-level vision tasks and lack the dense prediction ability to generate high resolution images due to the quadratic computational complexity of their self-attention mechanism. In addition, the original Transformer is not capable of modeling local correlations which is an important skill for image generation. To address these challenges, we propose two types of memory-friendly Transformer encoders, one for processing local correlations via local self-attention and another for modeling global information via global self-attention. By integrating the two proposed Transformer encoders, we present an efficient GL-Transformer for face photo-sketch synthesis, which can synthesize realistic face photo/sketch images from coarse to fine. Extensive experiments demonstrate that our model achieves a comparable or better performance beyond the state-of-the-art CNN-based methods both qualitatively and quantitatively.

引用

页码：483 / 495

页数：13

共 57 条

[1] Brown TB, 2020, Arxiv, DOI [arXiv:2005.14165, DOI 10.48550/ARXIV.2005.14165]
[2] Carion Nicolas, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12346), P213, DOI 10.1007/978-3-030-58452-8_13
[3] Chen M, 2020, PR MACH LEARN RES, V119
[4] Child R, 2019, Arxiv, DOI [arXiv:1904.10509, DOI 10.48550/ARXIV.1904.10509]
[5] Chu XX, 2021, ADV NEUR IN
[6] Deng YY, 2022, Arxiv, DOI arXiv:2105.14576
[7] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[8] CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows
Dong, Xiaoyi
Bao, Jianmin
Chen, Dongdong
Zhang, Weiming
Yu, Nenghai
Yuan, Lu
Chen, Dong
Guo, Baining
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 12114 - 12124
[9] Dosovitskiy A., 2021, IMAGE IS WORTH 1616
[10] Esser P, 2021, PROC IEEECVF C COMPU, P9

← 1 2 3 4 5 6 →