Adaptive Path Selection for Dynamic Image Captioning

被引:44
|
作者
Xian, Tiantao [1 ]
Li, Zhixin [1 ]
Tang, Zhenjun [1 ]
Ma, Huifang [2 ]
机构
[1] Guangxi Normal Univ, Guangxi Key Lab Multisource Informat Min & Secur, Guilin 541004, Peoples R China
[2] Northwest Normal Univ, Coll Comp Sci & Engn, Lanzhou 730070, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Feature extraction; Transformers; Semantics; Computational modeling; Adaptation models; Computer architecture; Image captioning; transformer; dynamic routing mechanism; TRANSFORMER;
D O I
10.1109/TCSVT.2022.3155795
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Image captioning is a challenging task, i.e., given an image machine automatically generates natural language that matches its semantic content and has attracted much attention in recent years. However, most existing models are designed manually, and their performance depends heavily on the expert experience of the designer. In addition, the computational flow of the model is predefined, and hard and easy samples will share the same coding path and easily interfere with each other, thus confusing the learning of the model. In this paper, we propose a Dynamic Transformer to change the encoding procedure from sequential to adaptive, i.e., data-dependent computing paths. Specifically, we design three different types of visual feature extraction blocks and deploy them in parallel at each layer to construct a multi-layer routing space in a fully connected manner. Each block contains a calculation unit that performs the corresponding operations and a routing gate that learns to adaptively select the direction to pass the signal based on the input image. Thus, our model can achieve a robust visual representation by exploring potential visual feature extraction paths. We evaluate our method quantitatively and qualitatively using a benchmark MSCOCO image caption dataset and perform extensive ablation studies to investigate the reasons behind its effectiveness. The experimental results show that our method is significantly superior to previous state-of-the-art methods.
引用
收藏
页码:5762 / 5775
页数:14
相关论文
共 50 条
  • [21] EdgeScan for IoT Contextual Understanding With Edge Computing and Image Captioning
    Hafeth, Deema Abdal
    Al-Khafajiy, Mohammed
    Kollias, Stefanos
    IEEE INTERNET OF THINGS JOURNAL, 2025, 12 (06): : 6519 - 6535
  • [22] Context-Adaptive-Based Image Captioning by Bi-CARU
    Im, Sio-Kei
    Chan, Ka-Hou
    IEEE ACCESS, 2023, 11 : 84934 - 84943
  • [23] Image Captioning With Positional and Geometrical Semantics
    Ul Haque, Anwar
    Ghani, Sayeed
    Saeed, Muhammad
    IEEE ACCESS, 2021, 9 : 160917 - 160925
  • [24] Prior Knowledge-Guided Transformer for Remote Sensing Image Captioning
    Meng, Lingwu
    Wang, Jing
    Yang, Yang
    Xiao, Liang
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61 : 1 - 13
  • [25] An Image Captioning Model Based on Bidirectional Depth Residuals and its Application
    Zhou, Ziwei
    Xu, Liang
    Wang, Chaoyang
    Xie, Wei
    Wang, Shuo
    Ge, Shaoqiang
    Zhang, Ye
    IEEE ACCESS, 2021, 9 : 25360 - 25370
  • [26] Multimodal Transformer With Multi-View Visual Representation for Image Captioning
    Yu, Jun
    Li, Jing
    Yu, Zhou
    Huang, Qingming
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (12) : 4467 - 4480
  • [27] De-Confounding Feature Fusion Transformer Network for Image Captioning in Assistive Navigation Applications for the Visually Impaired
    Cao, Zhengcai
    Xia, Ji
    Zhou, MengChu
    IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2025, 74
  • [28] Switching Text-Based Image Encoders for Captioning Images With Text
    Ueda, Arisa
    Yang, Wei
    Sugiura, Komei
    IEEE ACCESS, 2023, 11 : 55706 - 55715
  • [29] Embedded Heterogeneous Attention Transformer for Cross-Lingual Image Captioning
    Song, Zijie
    Hu, Zhenzhen
    Zhou, Yuanen
    Zhao, Ye
    Hong, Richang
    Wang, Meng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 9008 - 9020
  • [30] Hierarchical LSTMs with Adaptive Attention for Visual Captioning
    Gao, Lianli
    Li, Xiangpeng
    Song, Jingkuan
    Shen, Heng Tao
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2020, 42 (05) : 1112 - 1131