TridentCap: Image-Fact-Style Trident Semantic Framework for Stylized Image Captioning

被引:1
|
作者
Wang, Lanxiao [1 ]
Qiu, Heqian [1 ]
Qiu, Benliu [1 ]
Meng, Fanman [1 ]
Wu, Qingbo [1 ]
Li, Hongliang [1 ]
机构
[1] Univ Elect Sci & Technol China UESTC, Sch Informat & Commun Engn, Chengdu 611731, Peoples R China
基金
中国国家自然科学基金;
关键词
Semantics; Decoding; Dogs; Task analysis; Feature extraction; Annotations; Visualization; Stylized image captioning; trident data; image-fact-style; multi-style captioning; pseudo labels filter; NETWORK;
D O I
10.1109/TCSVT.2023.3315133
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Stylized image captioning (SIC) aims to generate captions with target style for images. The biggest challenge is that the collection and annotation of stylized data are pretty difficult and time-consuming. Most existing methods learn massive factual captions or additional stylized bookcorpus independently to assist in generating stylized caption, which ignore core relationships between existing image-fact-style trident data. In this paper, we propose a novel image-fact-style trident semantic framework TridentCap for stylized image captioning, which includes an image-fact semantic fusion encoder (SFE) and a trident stylization decoder (TSD). Unlike existing methods, we directly mine the core relationship in image-fact-style trident data and use factual semantic and image to build cross-modal semantic feature space, achieving the coherence between image and text. Specifically, SFE aims to learn the image-related prior language knowledge information from factual text and leverage fine-grained region-level semantic correlations of image and factual text to achieve cross-modal semantic information alignment and integration. TSD is designed to decouple the dual-source fused semantic feature based on the target style to achieve stylized caption generation. In addition, we design a pseudo labels filter (PLF) to obtain and expand massive image-fact-style trident data by building pseudo stylized annotations for all image-fact data in traditional caption datasets, which can further strengthen stylized caption learning. It is a generic algorithm to solve the problem of insufficient data and can be used into any existing stylized caption models. We conduct extensive experiments on SentiCap and FlickrStyle datasets, which achieve consistently improvement on almost all metrics. Our code will be released at: https://github.com/WangLanxiao/TridentCap_Code.
引用
收藏
页码:3563 / 3575
页数:13
相关论文
共 44 条
  • [1] Semantic-Guided Selective Representation for Image Captioning
    Li, Yinan
    Ma, Yiwei
    Zhou, Yiyi
    Yu, Xiao
    IEEE ACCESS, 2023, 11 : 14500 - 14510
  • [2] WordSentence Framework for Remote Sensing Image Captioning
    Wang, Qi
    Huang, Wei
    Zhang, Xueting
    Li, Xuelong
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2021, 59 (12): : 10532 - 10543
  • [3] Better Understanding: Stylized Image Captioning with Style Attention and Adversarial Training
    Yang, Zhenyu
    Liu, Qiao
    Liu, Guojing
    SYMMETRY-BASEL, 2020, 12 (12): : 1 - 16
  • [4] Cascade Semantic Prompt Alignment Network for Image Captioning
    Li, Jingyu
    Zhang, Lei
    Zhang, Kun
    Hu, Bo
    Xie, Hongtao
    Mao, Zhendong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 5266 - 5281
  • [5] Unpaired Image Captioning With semantic-Constrained Self-Learning
    Ben, Huixia
    Pan, Yingwei
    Li, Yehao
    Yao, Ting
    Hong, Richang
    Wang, Meng
    Mei, Tao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 904 - 916
  • [6] Semantic Representations With Attention Networks for Boosting Image Captioning
    Hafeth, Deema Abdal
    Kollias, Stefanos
    Ghafoor, Mubeen
    IEEE ACCESS, 2023, 11 : 40230 - 40239
  • [7] Recurrent Attention and Semantic Gate for Remote Sensing Image Captioning
    Li, Yunpeng
    Zhang, Xiangrong
    Gu, Jing
    Li, Chen
    Wang, Xin
    Tang, Xu
    Jiao, Licheng
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
  • [8] Detach and Attach: Stylized Image Captioning without Paired Stylized Dataset
    Tan, Yutong
    Lin, Zheng
    Fu, Peng
    Zheng, Mingyu
    Wang, Lanrui
    Cao, Yanan
    Wang, Weiping
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4733 - 4741
  • [9] Adaptive Semantic-Enhanced Transformer for Image Captioning
    Zhang, Jing
    Fang, Zhongjun
    Sun, Han
    Wang, Zhe
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (02) : 1785 - 1796
  • [10] Learning Cooperative Neural Modules for Stylized Image Captioning
    Xinxiao Wu
    Wentian Zhao
    Jiebo Luo
    International Journal of Computer Vision, 2022, 130 : 2305 - 2320