Scripted Video Generation With a Bottom-Up Generative Adversarial Network

被引:14
作者
Chen, Qi [1 ,2 ]
Wu, Qi [3 ]
Chen, Jian [1 ]
Wu, Qingyao [1 ]
van den Hengel, Anton [3 ]
Tan, Mingkui [1 ]
机构
[1] South China Univ Technol, Sch Software Engn, Guangzhou 510640, Peoples R China
[2] Pazhou Lab, Guangzhou 510335, Peoples R China
[3] Univ Adelaide, Sch Comp Sci, Adelaide, SA 5005, Australia
基金
中国国家自然科学基金;
关键词
Generative adversarial networks; video generation; semantic alignment; temporal coherence;
D O I
10.1109/TIP.2020.3003227
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Generating videos given a text description (such as a script) is non-trivial due to the intrinsic complexity of image frames and the structure of videos. Although Generative Adversarial Networks (GANs) have been successfully applied to generate images conditioned on a natural language description, it is still very challenging to generate realistic videos in which the frames are required to follow both spatial and temporal coherence. In this paper, we propose a novel Bottom-up GAN (BoGAN) method for generating videos given a text description. To ensure the coherence of the generated frames and also make the whole video match the language descriptions semantically, we design a bottom-up optimisation mechanism to train BoGAN. Specifically, we devise a region-level loss via attention mechanism to preserve the local semantic alignment and draw details in different sub-regions of video conditioned on words which are most relevant to them. Moreover, to guarantee the matching between text and frame, we introduce a frame-level discriminator, which can also maintain the fidelity of each frame and the coherence across frames. Last, to ensure the global semantic alignment between whole video and given text, we apply a video-level discriminator. We evaluate the effectiveness of the proposed BoGAN on two synthetic datasets (i.e., SBMG and TBMG) and two real-world datasets (i.e., MSVD and KTH).
引用
收藏
页码:7454 / 7467
页数:14
相关论文
共 57 条
  • [41] Sutskever I., 2014, ADV NEURAL INFORM PR, VVolume 2, P3104
  • [42] Improved ArtGAN for Conditional Synthesis of Natural Image and Artwork
    Tan, Wei Ren
    Chan, Chee Seng
    Aguirre, Hernan E.
    Tanaka, Kiyoshi
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2019, 28 (01) : 394 - 409
  • [43] MoCoGAN: Decomposing Motion and Content for Video Generation
    Tulyakov, Sergey
    Liu, Ming-Yu
    Yang, Xiaodong
    Kautz, Jan
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 1526 - 1535
  • [44] van Amersfoort J., 2017, ARXIV170108435
  • [45] Vondrick C., 2016, P C NEUR INF PROC SY, P613
  • [46] One-Step Time-Dependent Future Video Frame Prediction with a Convolutional Encoder-Decoder Neural Network
    Vukotic, Vedran
    Pintea, Silvia-Laura
    Raymond, Christian
    Gravier, Guillaume
    van Gemert, Jan C.
    [J]. IMAGE ANALYSIS AND PROCESSING,(ICIAP 2017), PT I, 2017, 10484 : 140 - 151
  • [47] An Uncertain Future: Forecasting from Static Images Using Variational Autoencoders
    Walker, Jacob
    Doersch, Carl
    Gupta, Abhinav
    Hebert, Martial
    [J]. COMPUTER VISION - ECCV 2016, PT VII, 2016, 9911 : 835 - 851
  • [48] Wang T. -C., 2018, Adv. Neural. Inf. Process. Syst., V31, P1144
  • [49] Learning to Generate Time-Lapse Videos Using Multi-Stage Dynamic Generative Adversarial Networks
    Xiong, Wei
    Luo, Wenhan
    Ma, Lin
    Liu, Wei
    Luo, Jiebo
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 2364 - 2373
  • [50] AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks
    Xu, Tao
    Zhang, Pengchuan
    Huang, Qiuyuan
    Zhang, Han
    Gan, Zhe
    Huang, Xiaolei
    He, Xiaodong
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 1316 - 1324