Interactions Guided Generative Adversarial Network for unsupervised image captioning

被引：23

作者：

Cao, Shan ^{[1
,2
]}

An, Gaoyun ^{[1
,2
]}

Zheng, Zhenxing ^{[1
,2
]}

Ruan, Qiuqi ^{[1
,2
]}

机构：

[1] Beijing Jiaotong Univ, Inst Informat Sci, Beijing 100044, Peoples R China

[2] Beijing Key Lab Adv Informat Sci & Network Techno, Beijing 100044, Peoples R China

来源：

NEUROCOMPUTING | 2020年 / 417卷

基金：

中国国家自然科学基金;

关键词：

Unsupervised image caption; Object-object interactions; Multi-scale feature;

D O I：

10.1016/j.neucom.2020.08.019

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Most of the current image captioning models that have achieved great successes heavily depend on manually labeled image-caption pairs. However, it is expensive and time-consuming to acquire large scale paired data. In this paper, we propose the Interactions Guided Generative Adversarial Network (IGGAN) for unsupervised image captioning, which joints multi-scale feature representation and object-object interactions. To get robust feature representation, the image is encoded by ResNet with a new Multi-scale module and adaptive Channel attention (RMCNet). Moreover, the information on object-object interactions is extracted by our Mutual Attention Network (MAN) and then adopted in the process of adversarial generation, which enhances the rationality of generated sentences. To encourage the sentence to be semantically consistent with the image, we utilize the image and generated sentence to reconstruct each other by cycle consistency in IGGAN. Our proposed model can generate sentences without any manually labeled image-caption pairs. Experimental results show that our proposed model achieves quite promising performance on the MSCOCO image captioning dataset. The ablation studies validate the effectiveness of our proposed modules. (C) 2020 Elsevier B.V. All rights reserved.

引用

页码：419 / 431

页数：13

共 57 条

[1] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].

Anderson, Peter ;

He, Xiaodong ;

Buehler, Chris ;

Teney, Damien ;

Johnson, Mark ;

Gould, Stephen ;

Zhang, Lei .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086

[2] SPICE: Semantic Propositional Image Caption Evaluation [J].

Anderson, Peter ;

Fernando, Basura ;

Johnson, Mark ;

Gould, Stephen .

COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 :382-398

[3] Convolutional Image Captioning [J].

Aneja, Jyoti ;

Deshpande, Aditya ;

Schwing, Alexander G. .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :5561-5570

[4]

Artetxe Mikel, 2017, 6 INT C LEARN REPR I

[5] SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning [J].

Chen, Long ;

Zhang, Hanwang ;

Xiao, Jun ;

Nie, Liqiang ;

Shao, Jian ;

Liu, Wei ;

Chua, Tat-Seng .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6298-6306

[6] Show, Adapt and Tell: Adversarial Training of Cross-domain Image Captioner [J].

Chen, Tseng-Hung ;

Liao, Yuan-Hong ;

Chuang, Ching-Yao ;

Hsu, Wan-Ting ;

Fu, Jianlong ;

Sun, Min .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :521-530

[7]

Chen X., 2015, MICROSOFT COCO CAPTI

[8] Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions [J].

Cornia, Marcella ;

Baraldi, Lorenzo ;

Cucchiara, Rita .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :8299-8308

[9] Towards Diverse and Natural Image Descriptions via a Conditional GAN [J].

Dai, Bo ;

Fidler, Sanja ;

Urtasun, Raquel ;

Lin, Dahua .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :2989-2998

[10]

Denkowski M., 2014, PROC ACL WORKSHOP, P376

← 1 2 3 4 5 6 →