Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes

被引:30
作者
Bond-Taylor, Sam [1 ]
Hessey, Peter [1 ]
Sasaki, Hiroshi [1 ]
Breckon, Toby P. [1 ,2 ]
Willcocks, Chris G. [1 ]
机构
[1] Univ Durham, Dept Comp Sci, Durham, England
[2] Univ Durham, Dept Engn, Durham, England
来源
COMPUTER VISION, ECCV 2022, PT XXIII | 2022年 / 13683卷
关键词
Generative model; Diffusion; High-resolution image synthesis; ATTENTION;
D O I
10.1007/978-3-031-20050-2_11
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Whilst diffusion probabilistic models can generate high quality image content, key limitations remain in terms of both generating high-resolution imagery and their associated high computational requirements. Recent Vector-Quantized image models have overcome this limitation of image resolution but are prohibitively slow and unidirectional as they generate tokens via element-wise autoregressive sampling from the prior. By contrast, in this paper we propose a novel discrete diffusion probabilistic model prior which enables parallel prediction of VectorQuantized tokens by using an unconstrained Transformer architecture as the backbone. During training, tokens are randomly masked in an order-agnostic manner and the Transformer learns to predict the original tokens. This parallelism of Vector-Quantized token prediction in turn facilitates unconditional generation of globally consistent high-resolution and diverse imagery at a fraction of the computational expense. In this manner, we can generate image resolutions exceeding that of the original training set samples whilst additionally provisioning per-image likelihood estimates (in a departure from generative adversarial approaches). Our approach achieves state-of-the-art results in terms of the manifold overlap metrics Coverage (LSUN Bedroom: 0.83; LSUN Churches: 0.73; FFHQ: 0.80) and Density (LSUN Bedroom: 1.51; LSUN Churches: 1.12; FFHQ: 1.20), and performs competitively on FID (LSUN Bedroom: 3.27; LSUN Churches: 4.07; FFHQ: 6.11) whilst offering advantages in terms of both computation and reduced training set requirements.
引用
收藏
页码:170 / 188
页数:19
相关论文
共 92 条
[1]  
Austin Jacob, 2021, arXiv
[2]  
Barannikov Serguei, 2021, ARXIV
[3]  
Bengio Yoshua, 2013, Statistical Language and Speech Processing. First International Conference, SLSP 2013. Proceedings: LNCS 7978, P1, DOI 10.1007/978-3-642-39593-2_1
[4]   Deep Generative Modelling: A Comparative Review of VAEs, GANs, Normalizing Flows, Energy-Based and Autoregressive Models [J].
Bond-Taylor, Sam ;
Leach, Adam ;
Long, Yang ;
Willcocks, Chris G. .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (11) :7327-7347
[5]  
Borji A., 2021, ARXIV
[6]  
Bowman S. R., 2016, ARXIV
[7]  
Brock A., 2019, ICLR
[8]  
Chan W., 2020, PR MACH LEARN RES, P1403
[9]  
Chang H., 2022, P IEEE CVF C COMP VI, P11315
[10]   Towards a Neural Graphics Pipeline for Controllable Image Generation [J].
Chen, Xuelin ;
Cohen-Or, Daniel ;
Chen, Baoquan ;
Mitra, Niloy J. .
COMPUTER GRAPHICS FORUM, 2021, 40 (02) :127-140