Unpaired Image Captioning With semantic-Constrained Self-Learning

被引：31

作者：

Ben, Huixia ^{[1
]}

Pan, Yingwei ^{[2
]}

Li, Yehao ^{[2
]}

Yao, Ting ^{[2
]}

Hong, Richang ^{[1
]}

Wang, Meng ^{[1
]}

Mei, Tao ^{[2
]}

机构：

[1] Hefei Univ Technol, Sch Comp & Informat, Hefei 230009, Peoples R China

[2] JD AI Res, CV Lab, Beijing 100105, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2022年 / 24卷

基金：

中国国家自然科学基金;

关键词：

Semantics; Image recognition; Training; Visualization; Decoding; Task analysis; Dogs; Encoder-decoder networks; image captioning; self-supervised learning;

D O I：

10.1109/TMM.2021.3060948

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Image captioning has been an emerging and fast-developing research topic. Nevertheless, most existing works heavily rely on large amounts of image-sentence pairs and therefore hinder the practical applications of captioning in the wild. In this paper, we present a novel Semantic-Constrained Self-learning (SCS) framework that explores an iterative self-learning strategy to learn an image captioner with only unpaired image and text data. Technically, SCS consists of two stages, i.e., pseudo pair generation and captioner re-training, iteratively producing "pseudo" image-sentence pairs via a pre-trained captioner and re-training the captioner with the pseudo pairs, respectively. Particularly, both stages are guided by the recognized objects in the image, that act as semantic constraint to strengthen the semantic alignment between the input image and the output sentence. We leverage a semantic-constrained beam search for pseudo pair generation to regularize the decoding process with the recognized objects via forcing the inclusion/exclusion of the recognized/irrelevant objects in output sentence. For captioner re-training, a self-supervised triplet loss is utilized to preserve the relative semantic similarity ordering among generated sentences with regard to the input image triplets. Moreover, an object inclusion reward and an adversarial reward are adopted to encourage the inclusion of the predicted objects in the output sentence and pursue the generation of more realistic sentences during self-critical training, respectively. Experiments conducted on both dependent and independent unpaired data validate the superiority of SCS. More remarkably, we obtain the best published CIDEr score to-date of 74.7\% on COCO Karpathy test split for unpaired image captioning.

引用

页码：904 / 916

页数：13

共 53 条

[21] Towards Unsupervised Image Captioning with Shared Multimodal Embeddings [J].

Laina, Iro ;

Rupprecht, Christian ;

Navab, Nassir .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :7413-7423

[22]

Lample Guillaume, 2018, C TRACK P

[23]

Lee D.-H., 2013, Workshop on Challenges in Representation Learning, ICML, V3, P881

[24] Know More Say Less: Image Captioning Based on Scene Graphs [J].

Li, Xiangyang ;

Jiang, Shuqiang .

IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (08) :2117-2130

[25] Adding Chinese Captions to Images [J].

Li, Xirong ;

Lan, Weiyu ;

Dong, Jianfeng ;

Liu, Hailong .

ICMR'16: PROCEEDINGS OF THE 2016 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2016, :271-275

[26] Microsoft COCO: Common Objects in Context [J].

Lin, Tsung-Yi ;

Maire, Michael ;

Belongie, Serge ;

Hays, James ;

Perona, Pietro ;

Ramanan, Deva ;

Dollar, Piotr ;

Zitnick, C. Lawrence .

COMPUTER VISION - ECCV 2014, PT V, 2014, 8693 :740-755

[27] Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning [J].

Lu, Jiasen ;

Xiong, Caiming ;

Parikh, Devi ;

Socher, Richard .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :3242-3250

[28] X-Linear Attention Networks for Image Captioning [J].

Pan, Yingwei ;

Yao, Ting ;

Li, Yehao ;

Mei, Tao .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10968-10977

[29] CRA-Net: Composed Relation Attention Network for Visual Question Answering [J].

Peng, Liang ;

Yang, Yang ;

Wang, Zheng ;

Wu, Xiao ;

Huang, Zi .

PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, :1202-1210

[30] Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models [J].

Plummer, Bryan A. ;

Wang, Liwei ;

Cervantes, Chris M. ;

Caicedo, Juan C. ;

Hockenmaier, Julia ;

Lazebnik, Svetlana .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2641-2649

← 1 2 3 4 5 6 →