Enhancing Image Captioning with Transformer-Based Two-Pass Decoding Framework

被引:0
|
作者
Su, Jindian [1 ]
Mou, Yueqi [1 ]
Xie, Yunhao [2 ]
机构
[1] South China Univ Technol, Sch Comp Sci & Engn, Guangzhou, Peoples R China
[2] South China Univ Technol, Sch Software Engn, Guangzhou, Peoples R China
来源
ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT I, ICIC 2024 | 2024年 / 14875卷
关键词
Image Captioning; Two-Pass Decoding; Transformer;
D O I
10.1007/978-981-97-5663-6_15
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The two-pass decoding framework significantly enhances image captioning models. However, existing two-pass models often train from scratch, missing the opportunity to fully leverage pre-trained knowledge from single-pass models. This practice leads to increased training cost and complexity. In this paper, we propose a unified two-pass decoding framework comprising three core modules: a pre-trained Visual Encoder, a pre-trained Draft Decoder, and a Deliberation Decoder. To enable effective information alignment and complementation between image and draft caption, we design a Cross-Modality Fusion (CMF) module in the Deliberation Decoder, forming a Cross-Modality Fusion-based Deliberation Decoder (CMF-DD). During the training process, we facilitate the transfer of foundational knowledge by extensively sharing parameters between the Draft and Deliberation Decoders. At the same time, we fix parameters from the single-pass baseline and only update a small subset within the Deliberation Decoder to reduce cost and complexity. Additionally, we introduce a Dominance-Adaptive reward scoring algorithm within the reinforcement learning stage to pertinently enhance the quality of refinements. Experiments on MS COCO datasets demonstrate that our method achieves substantial improvements over single-pass decoding baselines and competes favorably with other two-pass decoding methods.
引用
收藏
页码:171 / 183
页数:13
相关论文
共 50 条
  • [21] SCAP: enhancing image captioning through lightweight feature sifting and hierarchical decoding
    Zhang, Yuhao
    Tong, Jiaqi
    Liu, Honglin
    VISUAL COMPUTER, 2025,
  • [22] A transformer-based adversarial network framework for steganography
    Xiao, Chaoen
    Peng, Sirui
    Zhang, Lei
    Wang, Jianxin
    Ding, Ding
    Zhang, Jianyi
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 269
  • [23] HEAD-SYNCHRONOUS DECODING FOR TRANSFORMER-BASED STREAMING ASR
    Li, Mohan
    Zorila, Catalin
    Doddipatla, Rama
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5909 - 5913
  • [24] A Transformer-Based Framework for Scene Text Recognition
    Selvam, Prabu
    Koilraj, Joseph Abraham Sundar
    Tavera Romero, Carlos Andres
    Alharbi, Meshal
    Mehbodniya, Abolfazl
    Webber, Julian L.
    Sengan, Sudhakar
    IEEE ACCESS, 2022, 10 : 100895 - 100910
  • [25] Fastformer: Transformer-Based Fast Reasoning Framework
    Zhu, Wenjuan
    Guo, Ling
    Zhang, Tianxiang
    Han, Feng
    Wei, Yi
    Gong, Xiaoqing
    Xu, Pengfei
    Guo, Jing
    FOURTEENTH INTERNATIONAL CONFERENCE ON GRAPHICS AND IMAGE PROCESSING, ICGIP 2022, 2022, 12705
  • [26] BaseNet: A transformer-based toolkit for nanopore sequencing signal decoding
    Li, Qingwen
    Sun, Chen
    Wang, Daqian
    Lou, Jizhong
    COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL, 2024, 23 : 3430 - 3444
  • [27] REMOTE SENSING IMAGE CAPTIONING WITH SVM-BASED DECODING
    Hoxha, Genc
    Melgani, Farid
    IGARSS 2020 - 2020 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2020, : 6734 - 6737
  • [28] Traffic Transformer: Transformer-based framework for temporal traffic accident prediction
    Al-Thani, Mansoor G.
    Sheng, Ziyu
    Cao, Yuting
    Yang, Yin
    AIMS MATHEMATICS, 2024, 9 (05): : 12610 - 12629
  • [29] A Dual-Feature-Based Adaptive Shared Transformer Network for Image Captioning
    Shi, Yinbin
    Xia, Ji
    Zhou, MengChu
    Cao, Zhengcai
    IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2024, 73 : 1 - 13
  • [30] Variational Transformer: A Framework Beyond the Tradeoff Between Accuracy and Diversity for Image Captioning
    Yang, Longzhen
    He, Lianghua
    Hu, Die
    Liu, Yihang
    Peng, Yitao
    Chen, Hongzhou
    Zhou, Mengchu
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024,