Memorizing Swin-Transformer Denoising Network for Diffusion Model

被引:2
作者
Chen, Jindou [1 ]
Shen, Yiqing [2 ]
机构
[1] Shanghai Jiao Tong Univ, AI Inst, Moe Key Lab Artificial Intelligence, Shanghai 200240, Peoples R China
[2] Johns Hopkins Univ, Dept Comp Sci, Baltimore, MD 21218 USA
关键词
diffusion models; denoising network; swin-transformer; memorizing attention mechanism;
D O I
10.3390/electronics13204050
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Diffusion models have garnered significant attention in the field of image generation. However, existing denoising architectures, such as U-Net, face limitations in capturing the global context, while Vision Transformers (ViTs) may struggle with local receptive fields. To address these challenges, we propose a novel Swin-Transformer-based denoising network architecture that leverages the strengths of both U-Net and ViT. Moreover, our approach integrates the k-Nearest Neighbor (kNN) based memorizing attention module into the Swin-Transformer, enabling it to effectively harness crucial contextual information from feature maps and enhance its representational capacity. Finally, we introduce an innovative hierarchical time stream embedding scheme that optimizes the incorporation of temporal cues during the denoising process. This method surpasses basic approaches like simple addition or concatenation of fixed time embeddings, facilitating a more effective fusion of temporal information. Extensive experiments conducted on four benchmark datasets demonstrate the superior performance of our proposed model compared to U-Net and ViT as denoising networks. Our model outperforms baselines on the CRC-VAL-HE-7K and CelebA datasets, achieving improved FID scores of 14.39 and 4.96, respectively, and even surpassing DiT and UViT under our experiment setting. The Memorizing Swin-Transformer architecture, coupled with the hierarchical time stream embedding, sets a new state-of-the-art in denoising diffusion models for image generation.
引用
收藏
页数:12
相关论文
共 35 条
[1]  
Apriyanti DH, 2020, HarvardDataverse, V1, DOI 10.7910/DVN/0HNECY
[2]  
Bao F, 2022, Arxiv, DOI [arXiv:2201.06503, 10.48550/arXiv.2201.06503]
[3]  
Bao Fan, 2023, P IEEE CVF C COMP VI
[4]  
Cao H., 2023, P COMP VIS ECCV 2022
[5]  
Cao JH, 2023, Arxiv, DOI arXiv:2306.17046
[6]   Transformer Interpretability Beyond Attention Visualization [J].
Chefer, Hila ;
Gur, Shir ;
Wolf, Lior .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :782-791
[7]  
Chen ZW, 2022, AAAI CONF ARTIF INTE, P410
[8]  
Dhariwal P, 2021, ADV NEUR IN, V34
[9]  
Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
[10]  
Fernandez F.-G., 2020, TorchCAM: class activation explorer