Cascade Semantic Prompt Alignment Network for Image Captioning

被引：3

作者：

Li, Jingyu ^{[1
]}

Zhang, Lei ^{[2
]}

Zhang, Kun ^{[2
]}

Hu, Bo ^{[2
]}

Xie, Hongtao ^{[2
]}

Mao, Zhendong ^{[1
,3
]}

机构：

[1] Univ Sci & Technol China, Sch Cyberspace Sci & Technol, Hefei 230022, Peoples R China

[2] Univ Sci & Technol China, Sch Informat Sci & Technol, Hefei 230022, Peoples R China

[3] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, Hefei 230022, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2024年 / 34卷 / 07期

基金：

中国国家自然科学基金;

关键词：

Semantics; Visualization; Feature extraction; Detectors; Integrated circuit modeling; Transformers; Task analysis; Image captioning; textual-visual alignment; RegionCLIP; prompt; TRANSFORMER;

D O I：

10.1109/TCSVT.2023.3343520

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Image captioning (IC) takes an image as input and generates open-form descriptions in the domain of natural language. IC requires the detection of objects, modeling of relations between them, an assessment of the semantics of the scene and representing the extracted knowledge in a language space. Previous detector-based models suffer from limited semantic perception capability due to predefined object detection classes and semantic inconsistency between visual region features and numeric labels of the detector. Inspired by the fact that text prompts in pre-trained multi-modal models contain specific linguistic knowledge rather than discrete labels, and excel at an open-form semantic understanding of visual inputs and their representation in the domain of natural language. We aim to distill and leverage the transferable language knowledge from the pre-trained RegionCLIP model to remedy the detector for generating rich image captioning. In this paper, we propose a novel Cascade Semantic Prompt Alignment Network (CSA-Net) to produce an aligned fine-grained regional semantic-visual space where rich and consistent textual semantic details are automatically incorporated to region features. Specifically, we first align the object semantic prompt and region features to produce semantic grounded object features. Then, we employ these object features and relation semantic prompt to predict the relations between objects. Finally, these enhanced object and relation features are fed into the language decoder, generating rich descriptions. Extensive experiments conducted on the MSCOCO dataset show that our method achieves a new state-of-the-art performance with 145.2% (single model) and 147.0% (ensemble of 4 models) CIDEr scores on the 'Karpathy' split, 141.6% (c5) and 144.1% (c40) CIDEr scores on the official online test server. Significantly, CSA-Net outperforms in generating captions with higher quality and diversity, achieving a RefCLIP-S score of 83.2. Moreover, we expand the testbeds to other challenging captioning benchmarks, i.e., nocaps datasets, CSA-Net demonstrates superior zero-shot capability. Source codes released at https://github.com/CrossmodalGroup/CSA-Net.

引用

页码：5266 / 5281

页数：16

共 50 条

[31] High-Order Interaction Learning for Image Captioning
Wang, Yanhui
Xu, Ning
Liu, An-An
Li, Wenhui
Zhang, Yongdong
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (07) : 4417 - 4430
[32] Dual Attention on Pyramid Feature Maps for Image Captioning
Yu, Litao
Zhang, Jian
Wu, Qiang
IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 1775 - 1786
[33] Imageability- and Length-Controllable Image Captioning
Kastner, Marc A.
Umemura, Kazuki
Ide, Ichiro
Kawanishi, Yasutomo
Hirayama, Takatsugu
Doman, Keisuke
Deguchi, Daisuke
Murase, Hiroshi
Satoh, Shin'Ichi
IEEE ACCESS, 2021, 9 (09): : 162951 - 162961
[34] Object semantic analysis for image captioning
Sen Du
Hong Zhu
Guangfeng Lin
Dong Wang
Jing Shi
Jing Wang
Multimedia Tools and Applications, 2023, 82 : 43179 - 43206
[35] Object semantic analysis for image captioning
Du, Sen
Zhu, Hong
Lin, Guangfeng
Wang, Dong
Shi, Jing
Wang, Jing
MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (28) : 43179 - 43206
[36] Image Captioning Based on Semantic Scenes
Zhao, Fengzhi
Yu, Zhezhou
Wang, Tao
Lv, Yi
ENTROPY, 2024, 26 (10)
[37] Triangle-Reward Reinforcement Learning: Visual-Linguistic Semantic Alignment for Image Captioning
Nie, Weizhi
Li, Jiesi
Xu, Ning
Liu, An-An
Li, Xuanya
Zhang, Yongdong
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4510 - 4518
[38] Memory-Based Augmentation Network for Video Captioning
Jing, Shuaiqi
Zhang, Haonan
Zeng, Pengpeng
Gao, Lianli
Song, Jingkuan
Shen, Heng Tao
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 2367 - 2379
[39] Deep Hierarchical Encoder-Decoder Network for Image Captioning
Xiao, Xinyu
Wang, Lingfeng
Ding, Kun
Xiang, Shiming
Pan, Chunhong
IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (11) : 2942 - 2956
[40] Region-Aware Image Captioning via Interaction Learning
Liu, An-An
Zhai, Yingchen
Xu, Ning
Nie, Weizhi
Li, Wenhui
Zhang, Yongdong
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (06) : 3685 - 3696

← 1 2 3 4 5 →