All areWorth Words: A ViT Backbone for Diffusion Models

被引：79

作者：

Bao, Fan ^{[1
]}

Nie, Shen ^{[2
,3
]}

Xue, Kaiwen ^{[2
,3
]}

Cao, Yue ^{[4
]}

Li, Chongxuan ^{[2
,3
]}

Su, Hang

Zhu, Jun ^{[1
]}

机构：

[1] Inst AI, Dept Comp Sci & Tech, BNRist Ctr, Beijing, Peoples R China

[2] Renmin Univ China, GaoLing Sch Artificial Intelligence, Beijing, Peoples R China

[3] Beijing Key Lab Big Data Management & Anal Method, Beijing, Peoples R China

[4] Beijing Acad Artificial Intelligence, Beijing, Peoples R China

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年

关键词：

D O I：

10.1109/CVPR52729.2023.02171

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Vision transformers (ViT) have shown promise in various vision tasks while the U-Net based on a convolutional neural network (CNN) remains dominant in diffusion models. We design a simple and general ViT-based architecture (named U-ViT) for image generation with diffusion models. U-ViT is characterized by treating all inputs including the time, condition and noisy image patches as tokens and employing long skip connections between shallow and deep layers. We evaluate U-ViT in unconditional and class-conditional image generation, as well as text-to-image generation tasks, where U-ViT is comparable if not superior to a CNN-based U-Net of a similar size. In particular, latent diffusion models with U-ViT achieve record-breaking FID scores of 2.29 in class-conditional image generation on ImageNet 256x256, and 5.48 in text-to-image generation on MS-COCO, among methods without accessing large external datasets during the training of generative models. Our results suggest that, for diffusion-based image modeling, the long skip connection is crucial while the down-sampling and up-sampling operators in CNN-based U-Net are not always necessary. We believe that U-ViT can provide insights for future research on backbones in diffusion models and benefit generative modeling on large scale cross-modality datasets.

引用

页码：22669 / 22679

页数：11

共 77 条

[1]

Aberman Kfir, 2022, Prompt-to-prompt image editing with cross attention control

[2]

Austin J, 2021, ADV NEUR IN

[3]

Bao Fan, 2022, ANAL DPM ANAL ESTIMA

[4]

Bao Fan, 2022, PREPRINT, P4

[5]

Bao Fan, 2022, ESTIMATING OPTIMAL C, V1, P4

[6]

Brock A., 2019, Large scale gan training for high fidelity natural image synthesis

[7] High-Throughput Sequencing for Life-History Sorting and for Bridging Reference Sequences in Marine Gerromorpha (Insecta: Heteroptera) [J].

Chang, Jia Jin Marc ;

Ip, Yin Cheong Aden ;

Cheng, Lanna ;

Kunning, Ismael ;

Mana, Ralph R. ;

Wainwright, Benjamin J. ;

Huang, Danwei .

INSECT SYSTEMATICS AND DIVERSITY, 2022, 6 (01)

[8] Optimal Design of Clearances of Cylindrical Roller Bearing Components Based on Dynamic Analysis [J].

Chen, Lihai ;

Ma, Fang ;

Qiu, Ming ;

Dong, Yanfang ;

Pang, Xiaoxu ;

Li, Junxing ;

Yang, Chuanmeng .

MATHEMATICAL PROBLEMS IN ENGINEERING, 2022, 2022

[9]

Chen Nanxin, 2021, 9 INT C LEARN REPR

[10]

Chen Ting, 2022, Analog bits: Generating discrete data using diffusion models with self-conditioning

← 1 2 3 4 5 6 7 8 →