Dual-visual collaborative enhanced transformer for image captioning

被引：0

作者：

Mou, Zhenping ^{[1
]}

Song, Tianqi ^{[2
]}

Luo, Hong ^{[3
]}

机构：

[1] Chongqing Univ Posts & Telecommun, Sch Comp Sci & Technol, Chongqing 400065, Peoples R China

[2] Chongqing Normal Univ, Sch Comp & Informat Sci, Chongqing 401331, Peoples R China

[3] China Mobile Hangzhou Informat Technol Co Ltd, Intelligent Connected Prod Dept, Hangzhou 311100, Zhejiang, Peoples R China

来源：

MULTIMEDIA SYSTEMS | 2025年 / 31卷 / 02期

关键词：

Image captioning; Transformer; Collaborative; Graph aggregation;

D O I：

10.1007/s00530-025-01775-9

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The Transformer based model has achieved significant performance improvement in the image captioning. However, at present, this field still faces two main problems: one is the lack of contextual visual information, and the other is how to effectively integrate two different types of visual features. Therefore, we propose a Dual-Visual Collaborative Enhanced Transformer (DVCET) model that makes full use of grid and region visual features to improve the performance of image captions. Specifically, we design a Grid Aggregation Encoding Layer (GAEL) to integrate adjacent context information into each feature and help the model capture local and global context information in the image. Then, we design the Region Graph Memory Encoding Layer (RGMEL) to read and write the visual region representation using the visual graph memory to achieve object level relational reasoning. Finally, we introduce Dual Gated Collaboration (DGC) to make better use of the multi-level characteristics of the two through the dual gate control operation and reduce the information redundancy of the grid structure. The experimental results on MSCOCO and Flickr30K verify that the proposed model can improve the description performance, achieve good results on multiple evaluation scores, and achieve the performance of competing with the most advanced methods.

引用

页数：12

共 47 条

[1]

Albawi S, 2017, I C ENG TECHNOL

[2] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].

Anderson, Peter ;

He, Xiaodong ;

Buehler, Chris ;

Teney, Damien ;

Johnson, Mark ;

Gould, Stephen ;

Zhang, Lei .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086

[3] SPICE: Semantic Propositional Image Caption Evaluation [J].

Anderson, Peter ;

Fernando, Basura ;

Johnson, Mark ;

Gould, Stephen .

COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 :382-398

[4] CaMEL: Mean Teacher Learning for Image Captioning [J].

Barraco, Manuele ;

Stefanini, Matteo ;

Cornia, Marcella ;

Cascianelli, Silvia ;

Baraldi, Lorenzo ;

Cucchiara, Rita .

2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, :4087-4094

[5] Deep learning-based facial emotion recognition for human-computer interaction applications [J].

Chowdary, M. Kalpana ;

Nguyen, Tu N. ;

Hemanth, D. Jude .

NEURAL COMPUTING & APPLICATIONS, 2023, 35 (32) :23311-23328

[6] Meshed-Memory Transformer for Image Captioning [J].

Cornia, Marcella ;

Stefanini, Matteo ;

Baraldi, Lorenzo ;

Cucchiara, Rita .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10575-10584

[7] Scalable Deep Hashing for Large-Scale Social Image Retrieval [J].

Cui, Hui ;

Zhu, Lei ;

Li, Jingjing ;

Yang, Yang ;

Nie, Liqiang .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 :1271-1284

[8]

Denkowski M., 2014, P 9 WORKSH STAT MACH, P376

[9]

Devlin J, 2019, Arxiv, DOI arXiv:1810.04805

[10]

Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929

← 1 2 3 4 5 →