A Survey of Multimodal Controllable Diffusion Models

被引:3
作者
Jiang, Rui [1 ]
Zheng, Guang-Cong [1 ]
Li, Teng [1 ]
Yang, Tian-Rui [2 ]
Wang, Jing-Dong [3 ]
Li, Xi [1 ]
机构
[1] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou 310007, Peoples R China
[2] Nanjing Univ, Dept Math, Nanjing 210023, Peoples R China
[3] Baidu Inc, Baidu Visual Technol Dept, Beijing 100085, Peoples R China
基金
美国国家科学基金会; 中国国家自然科学基金;
关键词
diffusion model; controllable generation; application; personalization;
D O I
10.1007/s11390-024-3814-0
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Diffusion models have recently emerged as powerful generative models, producing high-fidelity samples across domains. Despite this, they have two key challenges, including improving the time-consuming iterative generation process and controlling and steering the generation process. Existing surveys provide broad overviews of diffusion model advancements. However, they lack comprehensive coverage specifically centered on techniques for controllable generation. This survey seeks to address this gap by providing a comprehensive and coherent review on controllable generation in diffusion models. We provide a detailed taxonomy defining controlled generation for diffusion models. Controllable generation is categorized based on the formulation, methodologies, and evaluation metrics. By enumerating the range of methods researchers have developed for enhanced control, we aim to establish controllable diffusion generation as a distinct subfield warranting dedicated focus. With this survey, we contextualize recent results, provide the dedicated treatment of controllable diffusion model generation, and outline limitations and future directions. To demonstrate applicability, we highlight controllable diffusion techniques for major computer vision tasks application. By consolidating methods and applications for controllable diffusion models, we hope to catalyze further innovations in reliable and scalable controllable generation.
引用
收藏
页码:509 / 541
页数:33
相关论文
共 233 条
[1]  
Anderson Brian DO., 1982, Stochastic Processes and their Applications, V12, P313, DOI DOI 10.1016/0304-4149(82)90051-5
[2]   Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models [J].
Arar, Moab ;
Gal, Rinon ;
Atzmon, Yuval ;
Chechik, Gal ;
Cohen-Or, Daniel ;
Shamir, Ariel ;
Bermano, Amit H. .
PROCEEDINGS OF THE SIGGRAPH ASIA 2023 CONFERENCE PAPERS, 2023,
[3]  
Ascher U.M., 1998, Computer methods for ordinary differential equations and differential-algebraic equations, DOI DOI 10.1137/1.9781611971392
[4]  
Austin J, 2021, ADV NEUR IN
[5]   SpaText: Spatio-Textual Representation for Controllable Image Generation [J].
Avrahami, Omri ;
Hayes, Thomas ;
Gafni, Oran ;
Gupta, Sonal ;
Taigman, Yaniv ;
Parikh, Devi ;
Lischinski, Dani ;
Fried, Ohad ;
Yin, Xi .
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :18370-18380
[6]   Blended Latent Diffusion [J].
Avrahami, Omri ;
Fried, Ohad ;
Lischinski, Dani .
ACM TRANSACTIONS ON GRAPHICS, 2023, 42 (04)
[7]  
Balaji Y., 2022, arXiv
[8]  
Bansal A, 2022, Arxiv, DOI [arXiv:2208.09392, 10.48550/arXiv.2208.09392, DOI 10.48550/ARXIV.2208.09392]
[9]   All areWorth Words: A ViT Backbone for Diffusion Models [J].
Bao, Fan ;
Nie, Shen ;
Xue, Kaiwen ;
Cao, Yue ;
Li, Chongxuan ;
Su, Hang ;
Zhu, Jun .
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :22669-22679
[10]  
Bao F, 2022, Arxiv, DOI [arXiv:2201.06503, 10.48550/arXiv.2201.06503]