Dual-adaptive interactive transformer with textual and visual context for image captioning

被引：3

作者：

Chen, Lizhi ^{[1
]}

Li, Kesen ^{[2
]}

机构：

[1] Soochow Univ, Sch Software, Suzhou 215000, Peoples R China

[2] Zhejiang A&F Univ, Coll Engn Technol, Jiyang Coll, Zhuji 311899, Peoples R China

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2024年 / 243卷

关键词：

Image captioning; Transformer; Textual and visual; Encoder-decoder; Adaptive Interactive;

D O I：

10.1016/j.eswa.2023.122955

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The multimodal Transformer, which integrates visual and textual contextual information, has recently shown success in image captioning tasks. However, there is still natural complementarity and redundancy between text and vision, and effectively integrating the information from both modalities is crucial for comprehending the content of an image. In this paper, we propose the Dual-Adaptive Interactive Transformer (DAIT), which incorporates similar textual and visual contextual information into the encoding and decoding stages. Specifically, during encoding, we propose the Adaptive Interactive Encoder (AIE), which expands the feature vectors for both modalities through the introduction of new operations. Furthermore, we also introduce normalization gate factors to mitigate noise caused by the interaction between the two modalities. During decoding, we propose the Adaptive Interactive Decoder (AID), which adaptively adjusts the multimodal features at each moment through similarity-weighted textual and visual branches. To validate our model, we conducted extensive experiments on the MS COCO image captioning dataset and achieved outstanding performance compared to many state-of-theart methods.

引用

页数：10

共 50 条

[1] Geometrically-Aware Dual Transformer Encoding Visual and Textual Features for Image Captioning
Chang, Yu-Ling
Ma, Hao-Shang
Li, Shiou-Chi
Huang, Jen-Wei
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT V, PAKDD 2024, 2024, 14649 : 15 - 27
[2] Dual-visual collaborative enhanced transformer for image captioning
Mou, Zhenping
Song, Tianqi
Luo, Hong
MULTIMEDIA SYSTEMS, 2025, 31 (02)
[3] Image Captioning Based on Visual Relevance and Context Dual Attention
Liu M.-F.
Shi Q.
Nie L.-Q.
Ruan Jian Xue Bao/Journal of Software, 2022, 33 (09):
[4] Relational Attention with Textual Enhanced Transformer for Image Captioning
Song, Lifei
Shi, Yiwen
Xiao, Xinyu
Zhang, Chunxia
Xiang, Shiming
PATTERN RECOGNITION AND COMPUTER VISION,, PT III, 2021, 13021 : 151 - 163
[5] GRIT: Faster and Better Image Captioning Transformer Using Dual Visual Features
Van-Quang Nguyen
Suganuma, Masanori
Okatani, Takayuki
COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 167 - 184
[6] Context-aware transformer for image captioning
Yang, Xin
Wang, Ying
Chen, Haishun
Li, Jie
Huang, Tingting
NEUROCOMPUTING, 2023, 549
[7] Context-assisted Transformer for Image Captioning
Lian Z.
Wang R.
Li H.-C.
Yao H.
Hu X.-H.
Zidonghua Xuebao/Acta Automatica Sinica, 2023, 49 (09): : 1889 - 1903
[8] A Dual-Feature-Based Adaptive Shared Transformer Network for Image Captioning
Shi, Yinbin
Xia, Ji
Zhou, MengChu
Cao, Zhengcai
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2024, 73 : 1 - 13
[9] Dual Global Enhanced Transformer for image captioning
Xian, Tiantao
Li, Zhixin
Zhang, Canlong
Ma, Huifang
NEURAL NETWORKS, 2022, 148 : 129 - 141
[10] Dual Position Relationship Transformer for Image Captioning
Wang, Yaohan
Qian, Wenhua
Nie, Rencan
Xu, Dan
Cao, Jinde
Kim, Pyoungwon
BIG DATA, 2022, 10 (06) : 515 - 527

← 1 2 3 4 5 →