A Comprehensive Pipeline for Complex Text-to-Image Synthesis

被引:11
作者
Fang, Fei [1 ]
Luo, Fei [1 ]
Zhang, Hong-Pan [1 ]
Zhou, Hua-Jian [1 ]
Chow, Alix L. H. [2 ]
Xiao, Chun-Xia [1 ]
机构
[1] Wuhan Univ, Sch Comp Sci, Wuhan 430072, Peoples R China
[2] Xiaomi Technol Co LTD, Beijing 100085, Peoples R China
基金
中国国家自然科学基金;
关键词
image synthesis; scene generation; text-to-image conversion; Markov Chain Monte Carlo (MCMC);
D O I
10.1007/s11390-020-0305-9
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Synthesizing a complex scene image with multiple objects and background according to text description is a challenging problem. It needs to solve several difficult tasks across the fields of natural language processing and computer vision. We model it as a combination of semantic entity recognition, object retrieval and recombination, and objects' status optimization. To reach a satisfactory result, we propose a comprehensive pipeline to convert the input text to its visual counterpart. The pipeline includes text processing, foreground objects and background scene retrieval, image synthesis using constrained MCMC, and post-processing. Firstly, we roughly divide the objects parsed from the input text into foreground objects and background scenes. Secondly, we retrieve the required foreground objects from the foreground object dataset segmented from Microsoft COCO dataset, and retrieve an appropriate background scene image from the background image dataset extracted from the Internet. Thirdly, in order to ensure the rationality of foreground objects' positions and sizes in the image synthesis step, we design a cost function and use the Markov Chain Monte Carlo (MCMC) method as the optimizer to solve this constrained layout problem. Finally, to make the image look natural and harmonious, we further use Poisson-based and relighting-based methods to blend foreground objects and background scene image in the post-processing step. The synthesized results and comparison results based on Microsoft COCO dataset prove that our method outperforms some of the state-of-the-art methods based on generative adversarial networks (GANs) in visual quality of generated scene images.
引用
收藏
页码:522 / 537
页数:16
相关论文
共 44 条
[1]  
[Anonymous], ARXIV160505396
[2]  
[Anonymous], 2019, P IEEE CVF C COMP VI
[3]  
[Anonymous], ARXIV151102793
[4]  
[Anonymous], 2009, ACM T GRAPHIC, DOI DOI 10.1145/1618452.1618470
[5]  
Chang Angel, 2014, P 2014 C EMPIRICAL M, P2028
[6]   PoseShop: Human Image Database Construction and Personalized Content Synthesis [J].
Chen, Tao ;
Tan, Ping ;
Ma, Li-Qian ;
Cheng, Ming-Ming ;
Shamir, Ariel ;
Hu, Shi-Min .
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2013, 19 (05) :824-837
[7]  
Coyne B, 2001, COMP GRAPH, P487, DOI 10.1145/383259.383316
[8]   Shape from Contour: Computation and Representation [J].
Elder, James H. .
ANNUAL REVIEW OF VISION SCIENCE, VOL 4, 2018, 4 :423-450
[9]   Narrative Collage of Image Collections by Scene Graph Recombination [J].
Fang, Fei ;
Yi, Miao ;
Feng, Hui ;
Hu, Shenghong ;
Xiao, Chunxia .
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2018, 24 (09) :2559-2572
[10]  
Fellbaum C, 2010, THEORY AND APPLICATIONS OF ONTOLOGY: COMPUTER APPLICATIONS, P231, DOI 10.1007/978-90-481-8847-5_10