Enhancing Image Captioning with Transformer-Based Two-Pass Decoding Framework

被引：0

作者：

Su, Jindian ^{[1
]}

Mou, Yueqi ^{[1
]}

Xie, Yunhao ^{[2
]}

机构：

[1] South China Univ Technol, Sch Comp Sci & Engn, Guangzhou, Peoples R China

[2] South China Univ Technol, Sch Software Engn, Guangzhou, Peoples R China

来源：

ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT I, ICIC 2024 | 2024年 / 14875卷

关键词：

Image Captioning; Two-Pass Decoding; Transformer;

D O I：

10.1007/978-981-97-5663-6_15

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The two-pass decoding framework significantly enhances image captioning models. However, existing two-pass models often train from scratch, missing the opportunity to fully leverage pre-trained knowledge from single-pass models. This practice leads to increased training cost and complexity. In this paper, we propose a unified two-pass decoding framework comprising three core modules: a pre-trained Visual Encoder, a pre-trained Draft Decoder, and a Deliberation Decoder. To enable effective information alignment and complementation between image and draft caption, we design a Cross-Modality Fusion (CMF) module in the Deliberation Decoder, forming a Cross-Modality Fusion-based Deliberation Decoder (CMF-DD). During the training process, we facilitate the transfer of foundational knowledge by extensively sharing parameters between the Draft and Deliberation Decoders. At the same time, we fix parameters from the single-pass baseline and only update a small subset within the Deliberation Decoder to reduce cost and complexity. Additionally, we introduce a Dominance-Adaptive reward scoring algorithm within the reinforcement learning stage to pertinently enhance the quality of refinements. Experiments on MS COCO datasets demonstrate that our method achieves substantial improvements over single-pass decoding baselines and competes favorably with other two-pass decoding methods.

引用

页码：171 / 183

页数：13

共 50 条

[21] SCAP: enhancing image captioning through lightweight feature sifting and hierarchical decoding
Zhang, Yuhao
Tong, Jiaqi
Liu, Honglin
VISUAL COMPUTER, 2025,
[22] A transformer-based adversarial network framework for steganography
Xiao, Chaoen
Peng, Sirui
Zhang, Lei
Wang, Jianxin
Ding, Ding
Zhang, Jianyi
EXPERT SYSTEMS WITH APPLICATIONS, 2025, 269
[23] HEAD-SYNCHRONOUS DECODING FOR TRANSFORMER-BASED STREAMING ASR
Li, Mohan
Zorila, Catalin
Doddipatla, Rama
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5909 - 5913
[24] A Transformer-Based Framework for Scene Text Recognition
Selvam, Prabu
Koilraj, Joseph Abraham Sundar
Tavera Romero, Carlos Andres
Alharbi, Meshal
Mehbodniya, Abolfazl
Webber, Julian L.
Sengan, Sudhakar
IEEE ACCESS, 2022, 10 : 100895 - 100910
[25] Fastformer: Transformer-Based Fast Reasoning Framework
Zhu, Wenjuan
Guo, Ling
Zhang, Tianxiang
Han, Feng
Wei, Yi
Gong, Xiaoqing
Xu, Pengfei
Guo, Jing
FOURTEENTH INTERNATIONAL CONFERENCE ON GRAPHICS AND IMAGE PROCESSING, ICGIP 2022, 2022, 12705
[26] BaseNet: A transformer-based toolkit for nanopore sequencing signal decoding
Li, Qingwen
Sun, Chen
Wang, Daqian
Lou, Jizhong
COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL, 2024, 23 : 3430 - 3444
[27] REMOTE SENSING IMAGE CAPTIONING WITH SVM-BASED DECODING
Hoxha, Genc
Melgani, Farid
IGARSS 2020 - 2020 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2020, : 6734 - 6737
[28] Traffic Transformer: Transformer-based framework for temporal traffic accident prediction
Al-Thani, Mansoor G.
Sheng, Ziyu
Cao, Yuting
Yang, Yin
AIMS MATHEMATICS, 2024, 9 (05): : 12610 - 12629
[29] A Dual-Feature-Based Adaptive Shared Transformer Network for Image Captioning
Shi, Yinbin
Xia, Ji
Zhou, MengChu
Cao, Zhengcai
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2024, 73 : 1 - 13
[30] Variational Transformer: A Framework Beyond the Tradeoff Between Accuracy and Diversity for Image Captioning
Yang, Longzhen
He, Lianghua
Hu, Die
Liu, Yihang
Peng, Yitao
Chen, Hongzhou
Zhou, Mengchu
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024,

← 1 2 3 4 5 →