Cascade Semantic Prompt Alignment Network for Image Captioning

被引:3
|
作者
Li, Jingyu [1 ]
Zhang, Lei [2 ]
Zhang, Kun [2 ]
Hu, Bo [2 ]
Xie, Hongtao [2 ]
Mao, Zhendong [1 ,3 ]
机构
[1] Univ Sci & Technol China, Sch Cyberspace Sci & Technol, Hefei 230022, Peoples R China
[2] Univ Sci & Technol China, Sch Informat Sci & Technol, Hefei 230022, Peoples R China
[3] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, Hefei 230022, Peoples R China
基金
中国国家自然科学基金;
关键词
Semantics; Visualization; Feature extraction; Detectors; Integrated circuit modeling; Transformers; Task analysis; Image captioning; textual-visual alignment; RegionCLIP; prompt; TRANSFORMER;
D O I
10.1109/TCSVT.2023.3343520
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Image captioning (IC) takes an image as input and generates open-form descriptions in the domain of natural language. IC requires the detection of objects, modeling of relations between them, an assessment of the semantics of the scene and representing the extracted knowledge in a language space. Previous detector-based models suffer from limited semantic perception capability due to predefined object detection classes and semantic inconsistency between visual region features and numeric labels of the detector. Inspired by the fact that text prompts in pre-trained multi-modal models contain specific linguistic knowledge rather than discrete labels, and excel at an open-form semantic understanding of visual inputs and their representation in the domain of natural language. We aim to distill and leverage the transferable language knowledge from the pre-trained RegionCLIP model to remedy the detector for generating rich image captioning. In this paper, we propose a novel Cascade Semantic Prompt Alignment Network (CSA-Net) to produce an aligned fine-grained regional semantic-visual space where rich and consistent textual semantic details are automatically incorporated to region features. Specifically, we first align the object semantic prompt and region features to produce semantic grounded object features. Then, we employ these object features and relation semantic prompt to predict the relations between objects. Finally, these enhanced object and relation features are fed into the language decoder, generating rich descriptions. Extensive experiments conducted on the MSCOCO dataset show that our method achieves a new state-of-the-art performance with 145.2% (single model) and 147.0% (ensemble of 4 models) CIDEr scores on the 'Karpathy' split, 141.6% (c5) and 144.1% (c40) CIDEr scores on the official online test server. Significantly, CSA-Net outperforms in generating captions with higher quality and diversity, achieving a RefCLIP-S score of 83.2. Moreover, we expand the testbeds to other challenging captioning benchmarks, i.e., nocaps datasets, CSA-Net demonstrates superior zero-shot capability. Source codes released at https://github.com/CrossmodalGroup/CSA-Net.
引用
收藏
页码:5266 / 5281
页数:16
相关论文
共 50 条
  • [21] Exploring Transformer and Multilabel Classification for Remote Sensing Image Captioning
    Kandala, Hitesh
    Saha, Sudipan
    Banerjee, Biplab
    Zhu, Xiao Xiang
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2022, 19
  • [22] A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing
    Cheng, Qimin
    Zhou, Yuzhuo
    Fu, Peng
    Xu, Yuan
    Zhang, Liang
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2021, 14 : 4284 - 4297
  • [23] Memorial GAN With Joint Semantic Optimization for Unpaired Image Captioning
    Song, Peipei
    Guo, Dan
    Zhou, Jinxing
    Xu, Mingliang
    Wang, Meng
    IEEE TRANSACTIONS ON CYBERNETICS, 2023, 53 (07) : 4388 - 4399
  • [24] Recurrent Attention and Semantic Gate for Remote Sensing Image Captioning
    Li, Yunpeng
    Zhang, Xiangrong
    Gu, Jing
    Li, Chen
    Wang, Xin
    Tang, Xu
    Jiao, Licheng
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
  • [25] Improving Remote Sensing Image Captioning by Combining Grid Features and Transformer
    Zhuang, Shuo
    Wang, Ping
    Wang, Gang
    Wang, Di
    Chen, Jinyong
    Gao, Feng
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2022, 19
  • [26] Integrating Part of Speech Guidance for Image Captioning
    Zhang, Ji
    Mei, Kuizhi
    Zheng, Yu
    Fan, Jianping
    IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 92 - 104
  • [27] Task-Adaptive Attention for Image Captioning
    Yan, Chenggang
    Hao, Yiming
    Li, Liang
    Yin, Jian
    Liu, Anan
    Mao, Zhendong
    Chen, Zhenyu
    Gao, Xingyu
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (01) : 43 - 51
  • [28] WordSentence Framework for Remote Sensing Image Captioning
    Wang, Qi
    Huang, Wei
    Zhang, Xueting
    Li, Xuelong
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2021, 59 (12): : 10532 - 10543
  • [29] Cascaded Revision Network for Novel Object Captioning
    Feng, Qianyu
    Wu, Yu
    Fan, Hehe
    Yan, Chenggang
    Xu, Mingliang
    Yang, Yi
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (10) : 3413 - 3421
  • [30] Weakly supervised grounded image captioning with semantic matching
    Du, Sen
    Zhu, Hong
    Lin, Guangfeng
    Liu, Yuanyuan
    Wang, Dong
    Shi, Jing
    Wu, Zhong
    APPLIED INTELLIGENCE, 2024, 54 (05) : 4300 - 4318