Enhancing Image Captioning with Transformer-Based Two-Pass Decoding Framework

被引：0

作者：

Su, Jindian ^{[1
]}

Mou, Yueqi ^{[1
]}

Xie, Yunhao ^{[2
]}

机构：

[1] South China Univ Technol, Sch Comp Sci & Engn, Guangzhou, Peoples R China

[2] South China Univ Technol, Sch Software Engn, Guangzhou, Peoples R China

来源：

ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT I, ICIC 2024 | 2024年 / 14875卷

关键词：

Image Captioning; Two-Pass Decoding; Transformer;

D O I：

10.1007/978-981-97-5663-6_15

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The two-pass decoding framework significantly enhances image captioning models. However, existing two-pass models often train from scratch, missing the opportunity to fully leverage pre-trained knowledge from single-pass models. This practice leads to increased training cost and complexity. In this paper, we propose a unified two-pass decoding framework comprising three core modules: a pre-trained Visual Encoder, a pre-trained Draft Decoder, and a Deliberation Decoder. To enable effective information alignment and complementation between image and draft caption, we design a Cross-Modality Fusion (CMF) module in the Deliberation Decoder, forming a Cross-Modality Fusion-based Deliberation Decoder (CMF-DD). During the training process, we facilitate the transfer of foundational knowledge by extensively sharing parameters between the Draft and Deliberation Decoders. At the same time, we fix parameters from the single-pass baseline and only update a small subset within the Deliberation Decoder to reduce cost and complexity. Additionally, we introduce a Dominance-Adaptive reward scoring algorithm within the reinforcement learning stage to pertinently enhance the quality of refinements. Experiments on MS COCO datasets demonstrate that our method achieves substantial improvements over single-pass decoding baselines and competes favorably with other two-pass decoding methods.

引用

页码：171 / 183

页数：13

共 50 条

[31] A Transformer-Based Bridge Structural Response Prediction Framework
Li, Ziqi
Li, Dongsheng
Sun, Tianshu
SENSORS, 2022, 22 (08)
[32] Recent progress in transformer-based medical image analysis
Liu, Zhaoshan
Lv, Qiujie
Yang, Ziduo
Li, Yifan
Lee, Chau Hung
Shen, Lei
COMPUTERS IN BIOLOGY AND MEDICINE, 2023, 164
[33] TBMF Framework: A Transformer-Based Multilevel Filtering Framework for PD Detection
Xu, Ning
Wang, Wensong
Fulnecek, Jan
Kabot, Ondrej
Misak, Stanislav
Wang, Lipo
Zheng, Yuanjin
Gooi, Hoay Beng
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, 2024, 71 (04) : 4098 - 4107
[34] A Novel Transformer-Based Attention Network for Image Dehazing
Gao, Guanlei
Cao, Jie
Bao, Chun
Hao, Qun
Ma, Aoqi
Li, Gang
SENSORS, 2022, 22 (09)
[35] A Transformer-Based Network for Deformable Medical Image Registration
Wang, Yibo
Qian, Wen
Li, Mengqi
Zhang, Xuming
ARTIFICIAL INTELLIGENCE, CICAI 2022, PT I, 2022, 13604 : 502 - 513
[36] Transformer-Based Distillation Hash Learning for Image Retrieval
Lv, Yuanhai
Wang, Chongyan
Yuan, Wanteng
Qian, Xiaohao
Yang, Wujun
Zhao, Wanqing
ELECTRONICS, 2022, 11 (18)
[37] Enhancing Transformer-Based Table Structure Recognition for Long Tables
Zhu, Ziyi
Zhao, Wenqi
Gao, Liangcai
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT VII, 2025, 15037 : 216 - 230
[38] Image2SMILES: Transformer-Based Molecular Optical Recognition Engine
Khokhlov, Ivan
Krasnov, Lev
Fedorov, Maxim V.
Sosnin, Sergey
CHEMISTRYMETHODS, 2022, 2 (01):
[39] Multi-feature fusion enhanced transformer with multi-layer fused decoding for image captioning
Jing Zhang
Zhongjun Fang
Zhe Wang
Applied Intelligence, 2023, 53 : 13398 - 13414
[40] Multi-feature fusion enhanced transformer with multi-layer fused decoding for image captioning
Zhang, Jing
Fang, Zhongjun
Wang, Zhe
APPLIED INTELLIGENCE, 2023, 53 (11) : 13398 - 13414

← 1 2 3 4 5 →