Generative adversarial network for semi-supervised image captioning

被引：1

作者：

Liang, Xu ^{[1
]}

Li, Chen ^{[1
]}

Tian, Lihua ^{[1
]}

机构：

[1] Xi An Jiao Tong Univ, Sch Software Engn, Xian 71000, Peoples R China

来源：

COMPUTER VISION AND IMAGE UNDERSTANDING | 2024年 / 249卷

关键词：

Transformer; Image captioning; Semi-supervised; CLIP; Generative adversarial network;

D O I：

10.1016/j.cviu.2024.104199

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Traditional supervised image captioning methods usually rely on a large number of images and paired captions for training. However, the creation of such datasets necessitates considerable temporal and human resources. Therefore, we propose a new semi-supervised image captioning algorithm to solve this problem. The proposed method uses a generative adversarial network to generate images that match captions, and uses these generated images and captions as new training data. This avoids the error accumulation problem when generating pseudo captions with autoregressive method and the network can directly perform backpropagation. At the same time, in order to ensure the correlation between the generated images and captions, we introduced the CLIP model for constraints. The CLIP model has been pre-trained on a large amount of image-text data, so it shows excellent performance in semantic alignment of images and text. To verify the effectiveness of our method, we validate on MSCOCO offline "Karpathy"test split. Experiment results show that our method can significantly improve the performance of the model when using 1% paired data, with the CIDEr score increasing from 69.5% to 77.7%. This shows that our method can effectively utilize unlabeled data for image caption tasks.

引用

页数：10

共 48 条

[1] SPICE: Semantic Propositional Image Caption Evaluation [J].

Anderson, Peter ;

Fernando, Basura ;

Johnson, Mark ;

Gould, Stephen .

COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 :382-398

[2]

[Anonymous], 2008, P 3 WORKSH STAT MACH

[3]

Arjovsky M, 2017, PR MACH LEARN RES, V70

[4] Pseudo Content Hallucination for Unpaired Image Captioning [J].

Ben, Huixia ;

Wang, Shuo ;

Wang, Meng ;

Hong, Richang .

PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, :320-329

[5] Unpaired Image Captioning With semantic-Constrained Self-Learning [J].

Ben, Huixia ;

Pan, Yingwei ;

Li, Yehao ;

Yao, Ting ;

Hong, Richang ;

Wang, Meng ;

Mei, Tao .

IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 :904-916

[6] Self-Distillation for Few-Shot Image Captioning [J].

Chen, Xianyu ;

Jiang, Ming ;

Zhao, Qi .

2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2021), 2021, :545-555

[7] Meshed-Memory Transformer for Image Captioning [J].

Cornia, Marcella ;

Stefanini, Matteo ;

Baraldi, Lorenzo ;

Cucchiara, Rita .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10575-10584

[8]

Devlin Jacob, 2018, 181004805 ARXIV

[9]

Dosovitskiy A, 2021, Arxiv, DOI [arXiv:2010.11929, DOI 10.48550/ARXIV.2010.11929]

[10] Unsupervised Image Captioning [J].

Feng, Yang ;

Ma, Lin ;

Liu, Wei ;

Luo, Jiebo .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :4120-4129

← 1 2 3 4 5 →