Interactions Guided Generative Adversarial Network for unsupervised image captioning

被引:23
作者
Cao, Shan [1 ,2 ]
An, Gaoyun [1 ,2 ]
Zheng, Zhenxing [1 ,2 ]
Ruan, Qiuqi [1 ,2 ]
机构
[1] Beijing Jiaotong Univ, Inst Informat Sci, Beijing 100044, Peoples R China
[2] Beijing Key Lab Adv Informat Sci & Network Techno, Beijing 100044, Peoples R China
基金
中国国家自然科学基金;
关键词
Unsupervised image caption; Object-object interactions; Multi-scale feature;
D O I
10.1016/j.neucom.2020.08.019
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most of the current image captioning models that have achieved great successes heavily depend on manually labeled image-caption pairs. However, it is expensive and time-consuming to acquire large scale paired data. In this paper, we propose the Interactions Guided Generative Adversarial Network (IGGAN) for unsupervised image captioning, which joints multi-scale feature representation and object-object interactions. To get robust feature representation, the image is encoded by ResNet with a new Multi-scale module and adaptive Channel attention (RMCNet). Moreover, the information on object-object interactions is extracted by our Mutual Attention Network (MAN) and then adopted in the process of adversarial generation, which enhances the rationality of generated sentences. To encourage the sentence to be semantically consistent with the image, we utilize the image and generated sentence to reconstruct each other by cycle consistency in IGGAN. Our proposed model can generate sentences without any manually labeled image-caption pairs. Experimental results show that our proposed model achieves quite promising performance on the MSCOCO image captioning dataset. The ablation studies validate the effectiveness of our proposed modules. (C) 2020 Elsevier B.V. All rights reserved.
引用
收藏
页码:419 / 431
页数:13
相关论文
共 57 条
  • [1] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
    Anderson, Peter
    He, Xiaodong
    Buehler, Chris
    Teney, Damien
    Johnson, Mark
    Gould, Stephen
    Zhang, Lei
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
  • [2] SPICE: Semantic Propositional Image Caption Evaluation
    Anderson, Peter
    Fernando, Basura
    Johnson, Mark
    Gould, Stephen
    [J]. COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 : 382 - 398
  • [3] Convolutional Image Captioning
    Aneja, Jyoti
    Deshpande, Aditya
    Schwing, Alexander G.
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 5561 - 5570
  • [4] Artetxe M., 2017, ARXIV PREPRINT ARXIV
  • [5] SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning
    Chen, Long
    Zhang, Hanwang
    Xiao, Jun
    Nie, Liqiang
    Shao, Jian
    Liu, Wei
    Chua, Tat-Seng
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 6298 - 6306
  • [6] Show, Adapt and Tell: Adversarial Training of Cross-domain Image Captioner
    Chen, Tseng-Hung
    Liao, Yuan-Hong
    Chuang, Ching-Yao
    Hsu, Wan-Ting
    Fu, Jianlong
    Sun, Min
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 521 - 530
  • [7] Chen X, 2015, CORR, V1504, P325
  • [8] Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions
    Cornia, Marcella
    Baraldi, Lorenzo
    Cucchiara, Rita
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 8299 - 8308
  • [9] Towards Diverse and Natural Image Descriptions via a Conditional GAN
    Dai, Bo
    Fidler, Sanja
    Urtasun, Raquel
    Lin, Dahua
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 2989 - 2998
  • [10] DENKOWSKI M., 2014, P 9 WORKSH STAT MACH