TridentCap: Image-Fact-Style Trident Semantic Framework for Stylized Image Captioning

被引:1
|
作者
Wang, Lanxiao [1 ]
Qiu, Heqian [1 ]
Qiu, Benliu [1 ]
Meng, Fanman [1 ]
Wu, Qingbo [1 ]
Li, Hongliang [1 ]
机构
[1] Univ Elect Sci & Technol China UESTC, Sch Informat & Commun Engn, Chengdu 611731, Peoples R China
基金
中国国家自然科学基金;
关键词
Semantics; Decoding; Dogs; Task analysis; Feature extraction; Annotations; Visualization; Stylized image captioning; trident data; image-fact-style; multi-style captioning; pseudo labels filter; NETWORK;
D O I
10.1109/TCSVT.2023.3315133
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Stylized image captioning (SIC) aims to generate captions with target style for images. The biggest challenge is that the collection and annotation of stylized data are pretty difficult and time-consuming. Most existing methods learn massive factual captions or additional stylized bookcorpus independently to assist in generating stylized caption, which ignore core relationships between existing image-fact-style trident data. In this paper, we propose a novel image-fact-style trident semantic framework TridentCap for stylized image captioning, which includes an image-fact semantic fusion encoder (SFE) and a trident stylization decoder (TSD). Unlike existing methods, we directly mine the core relationship in image-fact-style trident data and use factual semantic and image to build cross-modal semantic feature space, achieving the coherence between image and text. Specifically, SFE aims to learn the image-related prior language knowledge information from factual text and leverage fine-grained region-level semantic correlations of image and factual text to achieve cross-modal semantic information alignment and integration. TSD is designed to decouple the dual-source fused semantic feature based on the target style to achieve stylized caption generation. In addition, we design a pseudo labels filter (PLF) to obtain and expand massive image-fact-style trident data by building pseudo stylized annotations for all image-fact data in traditional caption datasets, which can further strengthen stylized caption learning. It is a generic algorithm to solve the problem of insufficient data and can be used into any existing stylized caption models. We conduct extensive experiments on SentiCap and FlickrStyle datasets, which achieve consistently improvement on almost all metrics. Our code will be released at: https://github.com/WangLanxiao/TridentCap_Code.
引用
收藏
页码:3563 / 3575
页数:13
相关论文
共 44 条
  • [21] Similar Scenes arouse Similar Emotions: Parallel Data Augmentation for Stylized Image Captioning
    Li, Guodun
    Zhai, Yuchen
    Lin, Zehao
    Zhang, Yin
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5363 - 5372
  • [22] Enhanced Semantic Similarity Learning Framework for Image-Text Matching
    Zhang, Kun
    Hu, Bo
    Zhang, Huatian
    Li, Zhe
    Mao, Zhendong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (04) : 2973 - 2988
  • [23] Multi-Level Policy and Reward-Based Deep Reinforcement Learning Framework for Image Captioning
    Xu, Ning
    Zhang, Hanwang
    Liu, An-An
    Nie, Weizhi
    Su, Yuting
    Nie, Jie
    Zhang, Yongdong
    IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (05) : 1372 - 1383
  • [24] A Mamba-Diffusion Framework for Multimodal Remote Sensing Image Semantic Segmentation
    Du, Wen-Liang
    Gu, Yang
    Zhao, Jiaqi
    Zhu, Hancheng
    Yao, Rui
    Zhou, Yong
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2024, 21
  • [25] SEGSID: A Semantic-Guided Framework for Sonar Image Despeckling
    Liu, Shaohua
    Lu, Junzhe
    Dou, Hongkun
    Li, Jiajun
    Deng, Yue
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2025, 34 : 652 - 666
  • [26] Semantic Structured Image Coding Framework for Multiple Intelligent Applications
    Sun, Simeng
    He, Tianyu
    Chen, Zhibo
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (09) : 3631 - 3642
  • [27] Parallel-fusion LSTM with synchronous semantic and visual information for image captioning
    Zhang, Jing
    Li, Kangkang
    Wang, Zhe
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2021, 75 (75)
  • [28] Semantic-Spatial Collaborative Perception Network for Remote Sensing Image Captioning
    Wang, Qi
    Yang, Zhigang
    Ni, Weiping
    Wu, Junzheng
    Li, Qiang
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [29] INVITATION: A Framework for Enhancing UAV Image Semantic Segmentation Accuracy Through Depth Information Fusion
    Zhang, Xiaodong
    Zhou, Wenlin
    Chen, Guanzhou
    Wang, Jiaqi
    Yang, Qingyuan
    Tan, Xiaoliang
    Wang, Tong
    Chen, Yifei
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2025, 22
  • [30] Variational Transformer: A Framework Beyond the Tradeoff Between Accuracy and Diversity for Image Captioning
    Yang, Longzhen
    He, Lianghua
    Hu, Die
    Liu, Yihang
    Peng, Yitao
    Chen, Hongzhou
    Zhou, Mengchu
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024,