Enhancing Image Captioning with Transformer-Based Two-Pass Decoding Framework

被引:0
|
作者
Su, Jindian [1 ]
Mou, Yueqi [1 ]
Xie, Yunhao [2 ]
机构
[1] South China Univ Technol, Sch Comp Sci & Engn, Guangzhou, Peoples R China
[2] South China Univ Technol, Sch Software Engn, Guangzhou, Peoples R China
来源
ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT I, ICIC 2024 | 2024年 / 14875卷
关键词
Image Captioning; Two-Pass Decoding; Transformer;
D O I
10.1007/978-981-97-5663-6_15
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The two-pass decoding framework significantly enhances image captioning models. However, existing two-pass models often train from scratch, missing the opportunity to fully leverage pre-trained knowledge from single-pass models. This practice leads to increased training cost and complexity. In this paper, we propose a unified two-pass decoding framework comprising three core modules: a pre-trained Visual Encoder, a pre-trained Draft Decoder, and a Deliberation Decoder. To enable effective information alignment and complementation between image and draft caption, we design a Cross-Modality Fusion (CMF) module in the Deliberation Decoder, forming a Cross-Modality Fusion-based Deliberation Decoder (CMF-DD). During the training process, we facilitate the transfer of foundational knowledge by extensively sharing parameters between the Draft and Deliberation Decoders. At the same time, we fix parameters from the single-pass baseline and only update a small subset within the Deliberation Decoder to reduce cost and complexity. Additionally, we introduce a Dominance-Adaptive reward scoring algorithm within the reinforcement learning stage to pertinently enhance the quality of refinements. Experiments on MS COCO datasets demonstrate that our method achieves substantial improvements over single-pass decoding baselines and competes favorably with other two-pass decoding methods.
引用
收藏
页码:171 / 183
页数:13
相关论文
共 50 条
  • [31] A Transformer-Based Bridge Structural Response Prediction Framework
    Li, Ziqi
    Li, Dongsheng
    Sun, Tianshu
    SENSORS, 2022, 22 (08)
  • [32] Recent progress in transformer-based medical image analysis
    Liu, Zhaoshan
    Lv, Qiujie
    Yang, Ziduo
    Li, Yifan
    Lee, Chau Hung
    Shen, Lei
    COMPUTERS IN BIOLOGY AND MEDICINE, 2023, 164
  • [33] TBMF Framework: A Transformer-Based Multilevel Filtering Framework for PD Detection
    Xu, Ning
    Wang, Wensong
    Fulnecek, Jan
    Kabot, Ondrej
    Misak, Stanislav
    Wang, Lipo
    Zheng, Yuanjin
    Gooi, Hoay Beng
    IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, 2024, 71 (04) : 4098 - 4107
  • [34] A Novel Transformer-Based Attention Network for Image Dehazing
    Gao, Guanlei
    Cao, Jie
    Bao, Chun
    Hao, Qun
    Ma, Aoqi
    Li, Gang
    SENSORS, 2022, 22 (09)
  • [35] A Transformer-Based Network for Deformable Medical Image Registration
    Wang, Yibo
    Qian, Wen
    Li, Mengqi
    Zhang, Xuming
    ARTIFICIAL INTELLIGENCE, CICAI 2022, PT I, 2022, 13604 : 502 - 513
  • [36] Transformer-Based Distillation Hash Learning for Image Retrieval
    Lv, Yuanhai
    Wang, Chongyan
    Yuan, Wanteng
    Qian, Xiaohao
    Yang, Wujun
    Zhao, Wanqing
    ELECTRONICS, 2022, 11 (18)
  • [37] Enhancing Transformer-Based Table Structure Recognition for Long Tables
    Zhu, Ziyi
    Zhao, Wenqi
    Gao, Liangcai
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT VII, 2025, 15037 : 216 - 230
  • [38] Image2SMILES: Transformer-Based Molecular Optical Recognition Engine
    Khokhlov, Ivan
    Krasnov, Lev
    Fedorov, Maxim V.
    Sosnin, Sergey
    CHEMISTRYMETHODS, 2022, 2 (01):
  • [39] Multi-feature fusion enhanced transformer with multi-layer fused decoding for image captioning
    Jing Zhang
    Zhongjun Fang
    Zhe Wang
    Applied Intelligence, 2023, 53 : 13398 - 13414
  • [40] Multi-feature fusion enhanced transformer with multi-layer fused decoding for image captioning
    Zhang, Jing
    Fang, Zhongjun
    Wang, Zhe
    APPLIED INTELLIGENCE, 2023, 53 (11) : 13398 - 13414