All areWorth Words: A ViT Backbone for Diffusion Models

被引:79
作者
Bao, Fan [1 ]
Nie, Shen [2 ,3 ]
Xue, Kaiwen [2 ,3 ]
Cao, Yue [4 ]
Li, Chongxuan [2 ,3 ]
Su, Hang
Zhu, Jun [1 ]
机构
[1] Inst AI, Dept Comp Sci & Tech, BNRist Ctr, Beijing, Peoples R China
[2] Renmin Univ China, GaoLing Sch Artificial Intelligence, Beijing, Peoples R China
[3] Beijing Key Lab Big Data Management & Anal Method, Beijing, Peoples R China
[4] Beijing Acad Artificial Intelligence, Beijing, Peoples R China
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年
关键词
D O I
10.1109/CVPR52729.2023.02171
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision transformers (ViT) have shown promise in various vision tasks while the U-Net based on a convolutional neural network (CNN) remains dominant in diffusion models. We design a simple and general ViT-based architecture (named U-ViT) for image generation with diffusion models. U-ViT is characterized by treating all inputs including the time, condition and noisy image patches as tokens and employing long skip connections between shallow and deep layers. We evaluate U-ViT in unconditional and class-conditional image generation, as well as text-to-image generation tasks, where U-ViT is comparable if not superior to a CNN-based U-Net of a similar size. In particular, latent diffusion models with U-ViT achieve record-breaking FID scores of 2.29 in class-conditional image generation on ImageNet 256x256, and 5.48 in text-to-image generation on MS-COCO, among methods without accessing large external datasets during the training of generative models. Our results suggest that, for diffusion-based image modeling, the long skip connection is crucial while the down-sampling and up-sampling operators in CNN-based U-Net are not always necessary. We believe that U-ViT can provide insights for future research on backbones in diffusion models and benefit generative modeling on large scale cross-modality datasets.
引用
收藏
页码:22669 / 22679
页数:11
相关论文
共 77 条
[1]  
Aberman Kfir, 2022, Prompt-to-prompt image editing with cross attention control
[2]  
Austin J, 2021, ADV NEUR IN
[3]  
Bao Fan, 2022, ANAL DPM ANAL ESTIMA
[4]  
Bao Fan, 2022, PREPRINT, P4
[5]  
Bao Fan, 2022, ESTIMATING OPTIMAL C, V1, P4
[6]  
Brock A., 2019, Large scale gan training for high fidelity natural image synthesis
[7]   High-Throughput Sequencing for Life-History Sorting and for Bridging Reference Sequences in Marine Gerromorpha (Insecta: Heteroptera) [J].
Chang, Jia Jin Marc ;
Ip, Yin Cheong Aden ;
Cheng, Lanna ;
Kunning, Ismael ;
Mana, Ralph R. ;
Wainwright, Benjamin J. ;
Huang, Danwei .
INSECT SYSTEMATICS AND DIVERSITY, 2022, 6 (01)
[8]   Optimal Design of Clearances of Cylindrical Roller Bearing Components Based on Dynamic Analysis [J].
Chen, Lihai ;
Ma, Fang ;
Qiu, Ming ;
Dong, Yanfang ;
Pang, Xiaoxu ;
Li, Junxing ;
Yang, Chuanmeng .
MATHEMATICAL PROBLEMS IN ENGINEERING, 2022, 2022
[9]  
Chen Nanxin, 2021, 9 INT C LEARN REPR
[10]  
Chen Ting, 2022, Analog bits: Generating discrete data using diffusion models with self-conditioning