Scripted Video Generation With a Bottom-Up Generative Adversarial Network

被引：14

作者：

Chen, Qi ^{[1
,2
]}

Wu, Qi ^{[3
]}

Chen, Jian ^{[1
]}

Wu, Qingyao ^{[1
]}

van den Hengel, Anton ^{[3
]}

Tan, Mingkui ^{[1
]}

机构：

[1] South China Univ Technol, Sch Software Engn, Guangzhou 510640, Peoples R China

[2] Pazhou Lab, Guangzhou 510335, Peoples R China

[3] Univ Adelaide, Sch Comp Sci, Adelaide, SA 5005, Australia

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2020年 / 29卷

基金：

中国国家自然科学基金;

关键词：

Generative adversarial networks; video generation; semantic alignment; temporal coherence;

D O I：

10.1109/TIP.2020.3003227

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Generating videos given a text description (such as a script) is non-trivial due to the intrinsic complexity of image frames and the structure of videos. Although Generative Adversarial Networks (GANs) have been successfully applied to generate images conditioned on a natural language description, it is still very challenging to generate realistic videos in which the frames are required to follow both spatial and temporal coherence. In this paper, we propose a novel Bottom-up GAN (BoGAN) method for generating videos given a text description. To ensure the coherence of the generated frames and also make the whole video match the language descriptions semantically, we design a bottom-up optimisation mechanism to train BoGAN. Specifically, we devise a region-level loss via attention mechanism to preserve the local semantic alignment and draw details in different sub-regions of video conditioned on words which are most relevant to them. Moreover, to guarantee the matching between text and frame, we introduce a frame-level discriminator, which can also maintain the fidelity of each frame and the coherence across frames. Last, to ensure the global semantic alignment between whole video and given text, we apply a video-level discriminator. We evaluate the effectiveness of the proposed BoGAN on two synthetic datasets (i.e., SBMG and TBMG) and two real-world datasets (i.e., MSVD and KTH).

引用

页码：7454 / 7467

页数：14

共 57 条

[1] Aila T., 2018, INT C LEARN REPR ICL
[2] Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
Anderson, Peter
Wu, Qi
Teney, Damien
Bruce, Jake
Johnson, Mark
Sunderhauf, Niko
Reid, Ian
Gould, Stephen
van den Hengel, Anton
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 3674 - 3683
[3] [Anonymous], 2018, ARXIV180907099
[4] [Anonymous], 2016, Advances in Neural Information Processing Systems
[5] Arjovsky M, 2017, PR MACH LEARN RES, V70
[6] Brock A., 2019, INT C LEARN REPR ICL
[7] A Joint Model for Text and Image Semantic Feature Extraction
Cao, Jiarun
Wang, Chongwen
Gao, Liming
[J]. 2018 INTERNATIONAL CONFERENCE ON ALGORITHMS, COMPUTING AND ARTIFICIAL INTELLIGENCE (ACAI 2018), 2018,
[8] Forecasting Human Dynamics from Static Images
Chao, Yu-Wei
Yang, Jimei
Price, Brian
Cohen, Scott
Deng, Jia
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 3643 - 3651
[9] Video Imagination from a Single Image with Transformation Generation
Chen, Baoyang
Wang, Wenmin
Wang, Jinzhuo
[J]. PROCEEDINGS OF THE THEMATIC WORKSHOPS OF ACM MULTIMEDIA 2017 (THEMATIC WORKSHOPS'17), 2017, : 358 - 366
[10] Eye In-Painting with Exemplar Generative Adversarial Networks
Dolhansky, Brian
Ferrer, Cristian Canton
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7902 - 7911

← 1 2 3 4 5 6 →