Dual-adaptive interactive transformer with textual and visual context for image captioning

被引:3
|
作者
Chen, Lizhi [1 ]
Li, Kesen [2 ]
机构
[1] Soochow Univ, Sch Software, Suzhou 215000, Peoples R China
[2] Zhejiang A&F Univ, Coll Engn Technol, Jiyang Coll, Zhuji 311899, Peoples R China
关键词
Image captioning; Transformer; Textual and visual; Encoder-decoder; Adaptive Interactive;
D O I
10.1016/j.eswa.2023.122955
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The multimodal Transformer, which integrates visual and textual contextual information, has recently shown success in image captioning tasks. However, there is still natural complementarity and redundancy between text and vision, and effectively integrating the information from both modalities is crucial for comprehending the content of an image. In this paper, we propose the Dual-Adaptive Interactive Transformer (DAIT), which incorporates similar textual and visual contextual information into the encoding and decoding stages. Specifically, during encoding, we propose the Adaptive Interactive Encoder (AIE), which expands the feature vectors for both modalities through the introduction of new operations. Furthermore, we also introduce normalization gate factors to mitigate noise caused by the interaction between the two modalities. During decoding, we propose the Adaptive Interactive Decoder (AID), which adaptively adjusts the multimodal features at each moment through similarity-weighted textual and visual branches. To validate our model, we conducted extensive experiments on the MS COCO image captioning dataset and achieved outstanding performance compared to many state-of-theart methods.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] Geometrically-Aware Dual Transformer Encoding Visual and Textual Features for Image Captioning
    Chang, Yu-Ling
    Ma, Hao-Shang
    Li, Shiou-Chi
    Huang, Jen-Wei
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT V, PAKDD 2024, 2024, 14649 : 15 - 27
  • [2] Dual-visual collaborative enhanced transformer for image captioning
    Mou, Zhenping
    Song, Tianqi
    Luo, Hong
    MULTIMEDIA SYSTEMS, 2025, 31 (02)
  • [3] Image Captioning Based on Visual Relevance and Context Dual Attention
    Liu M.-F.
    Shi Q.
    Nie L.-Q.
    Ruan Jian Xue Bao/Journal of Software, 2022, 33 (09):
  • [4] Relational Attention with Textual Enhanced Transformer for Image Captioning
    Song, Lifei
    Shi, Yiwen
    Xiao, Xinyu
    Zhang, Chunxia
    Xiang, Shiming
    PATTERN RECOGNITION AND COMPUTER VISION,, PT III, 2021, 13021 : 151 - 163
  • [5] GRIT: Faster and Better Image Captioning Transformer Using Dual Visual Features
    Van-Quang Nguyen
    Suganuma, Masanori
    Okatani, Takayuki
    COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 167 - 184
  • [6] Context-aware transformer for image captioning
    Yang, Xin
    Wang, Ying
    Chen, Haishun
    Li, Jie
    Huang, Tingting
    NEUROCOMPUTING, 2023, 549
  • [7] Context-assisted Transformer for Image Captioning
    Lian Z.
    Wang R.
    Li H.-C.
    Yao H.
    Hu X.-H.
    Zidonghua Xuebao/Acta Automatica Sinica, 2023, 49 (09): : 1889 - 1903
  • [8] A Dual-Feature-Based Adaptive Shared Transformer Network for Image Captioning
    Shi, Yinbin
    Xia, Ji
    Zhou, MengChu
    Cao, Zhengcai
    IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2024, 73 : 1 - 13
  • [9] Dual Global Enhanced Transformer for image captioning
    Xian, Tiantao
    Li, Zhixin
    Zhang, Canlong
    Ma, Huifang
    NEURAL NETWORKS, 2022, 148 : 129 - 141
  • [10] Dual Position Relationship Transformer for Image Captioning
    Wang, Yaohan
    Qian, Wenhua
    Nie, Rencan
    Xu, Dan
    Cao, Jinde
    Kim, Pyoungwon
    BIG DATA, 2022, 10 (06) : 515 - 527