Unpaired Image Captioning With semantic-Constrained Self-Learning

被引:30
|
作者
Ben, Huixia [1 ]
Pan, Yingwei [2 ]
Li, Yehao [2 ]
Yao, Ting [2 ]
Hong, Richang [1 ]
Wang, Meng [1 ]
Mei, Tao [2 ]
机构
[1] Hefei Univ Technol, Sch Comp & Informat, Hefei 230009, Peoples R China
[2] JD AI Res, CV Lab, Beijing 100105, Peoples R China
基金
中国国家自然科学基金;
关键词
Semantics; Image recognition; Training; Visualization; Decoding; Task analysis; Dogs; Encoder-decoder networks; image captioning; self-supervised learning;
D O I
10.1109/TMM.2021.3060948
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Image captioning has been an emerging and fast-developing research topic. Nevertheless, most existing works heavily rely on large amounts of image-sentence pairs and therefore hinder the practical applications of captioning in the wild. In this paper, we present a novel Semantic-Constrained Self-learning (SCS) framework that explores an iterative self-learning strategy to learn an image captioner with only unpaired image and text data. Technically, SCS consists of two stages, i.e., pseudo pair generation and captioner re-training, iteratively producing "pseudo" image-sentence pairs via a pre-trained captioner and re-training the captioner with the pseudo pairs, respectively. Particularly, both stages are guided by the recognized objects in the image, that act as semantic constraint to strengthen the semantic alignment between the input image and the output sentence. We leverage a semantic-constrained beam search for pseudo pair generation to regularize the decoding process with the recognized objects via forcing the inclusion/exclusion of the recognized/irrelevant objects in output sentence. For captioner re-training, a self-supervised triplet loss is utilized to preserve the relative semantic similarity ordering among generated sentences with regard to the input image triplets. Moreover, an object inclusion reward and an adversarial reward are adopted to encourage the inclusion of the predicted objects in the output sentence and pursue the generation of more realistic sentences during self-critical training, respectively. Experiments conducted on both dependent and independent unpaired data validate the superiority of SCS. More remarkably, we obtain the best published CIDEr score to-date of 74.7\% on COCO Karpathy test split for unpaired image captioning.
引用
收藏
页码:904 / 916
页数:13
相关论文
共 50 条
  • [1] Memorial GAN With Joint Semantic Optimization for Unpaired Image Captioning
    Song, Peipei
    Guo, Dan
    Zhou, Jinxing
    Xu, Mingliang
    Wang, Meng
    IEEE TRANSACTIONS ON CYBERNETICS, 2023, 53 (07) : 4388 - 4399
  • [2] Semantic-Guided Selective Representation for Image Captioning
    Li, Yinan
    Ma, Yiwei
    Zhou, Yiyi
    Yu, Xiao
    IEEE ACCESS, 2023, 11 : 14500 - 14510
  • [3] Self-Learning for Few-Shot Remote Sensing Image Captioning
    Zhou, Haonan
    Du, Xiaoping
    Xia, Lurui
    Li, Sen
    REMOTE SENSING, 2022, 14 (18)
  • [4] TridentCap: Image-Fact-Style Trident Semantic Framework for Stylized Image Captioning
    Wang, Lanxiao
    Qiu, Heqian
    Qiu, Benliu
    Meng, Fanman
    Wu, Qingbo
    Li, Hongliang
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (05) : 3563 - 3575
  • [5] Unpaired Image Captioning by Image-Level Weakly-Supervised Visual Concept Recognition
    Zhu, Peipei
    Wang, Xiao
    Luo, Yong
    Sun, Zhenglong
    Zheng, Wei-Shi
    Wang, Yaowei
    Chen, Changwen
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 6702 - 6716
  • [6] Cascade Semantic Prompt Alignment Network for Image Captioning
    Li, Jingyu
    Zhang, Lei
    Zhang, Kun
    Hu, Bo
    Xie, Hongtao
    Mao, Zhendong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 5266 - 5281
  • [7] High-Order Interaction Learning for Image Captioning
    Wang, Yanhui
    Xu, Ning
    Liu, An-An
    Li, Wenhui
    Zhang, Yongdong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (07) : 4417 - 4430
  • [8] Semantic Representations With Attention Networks for Boosting Image Captioning
    Hafeth, Deema Abdal
    Kollias, Stefanos
    Ghafoor, Mubeen
    IEEE ACCESS, 2023, 11 : 40230 - 40239
  • [9] Adaptive Semantic-Enhanced Transformer for Image Captioning
    Zhang, Jing
    Fang, Zhongjun
    Sun, Han
    Wang, Zhe
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (02) : 1785 - 1796
  • [10] Discriminative Style Learning for Cross-Domain Image Captioning
    Yuan, Jin
    Zhu, Shuai
    Huang, Shuyin
    Zhang, Hanwang
    Xiao, Yaoqiang
    Li, Zhiyong
    Wang, Meng
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 1723 - 1736