Multi-Modal Graph Aggregation Transformer for image captioning

被引:2
作者
Chen, Lizhi [1 ]
Li, Kesen [2 ]
机构
[1] Soochow Univ, Sch Comp Sci & Technol, Suzhou 215000, Peoples R China
[2] Zhejiang A&F Univ, Coll Engn Technol, Jiyang Coll, Zhuji 311899, Peoples R China
关键词
Image captioning; Transformer; Multi-Modal; Graph Aggregation;
D O I
10.1016/j.neunet.2024.106813
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The current image captioning directly encodes the detected target area and recognizes the objects in the image to correctly describe the image. However, it is unreliable to make full use of regional features because they cannot convey contextual information, such as the relationship between objects and the lack of object predicate level semantics. An effective model should contain multiple modes and explore their interactions to help understand the image. Therefore, we introduce the Multi-Modal Graph Aggregation Transformer (MMGAT), which uses the information of various image modes to fill this gap. It first represents an image as a graph consisting of three subgraphs, depicting context grid, region, and semantic text modalities respectively. Then, we introduce three aggregators that guide message passing from one graph to another to exploit context in different modalities, so as to refine the features of nodes. The updated nodes have better features for image captioning. We show significant performance scores of 144.6% CIDEr on MS-COCO and 80.3% CIDEr on Flickr30k compared to state of the arts, and conduct a rigorous analysis to demonstrate the importance of each part of our design.
引用
收藏
页数:10
相关论文
共 47 条
[1]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[2]   SPICE: Semantic Propositional Image Caption Evaluation [J].
Anderson, Peter ;
Fernando, Basura ;
Johnson, Mark ;
Gould, Stephen .
COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 :382-398
[3]   CaMEL: Mean Teacher Learning for Image Captioning [J].
Barraco, Manuele ;
Stefanini, Matteo ;
Cornia, Marcella ;
Cascianelli, Silvia ;
Baraldi, Lorenzo ;
Cucchiara, Rita .
2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, :4087-4094
[4]   Improving image captioning with Pyramid Attention and SC-GAN [J].
Chen, Tianyu ;
Li, Zhixin ;
Wu, Jingli ;
Ma, Huifang ;
Su, Bianping .
IMAGE AND VISION COMPUTING, 2022, 117
[5]   Meshed-Memory Transformer for Image Captioning [J].
Cornia, Marcella ;
Stefanini, Matteo ;
Baraldi, Lorenzo ;
Cucchiara, Rita .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10575-10584
[6]  
Denkowski MJ, 2014, P 9 WORKSH STAT MACH, P376, DOI DOI 10.3115/V1/W14-3348
[7]  
Devlin J, 2019, Arxiv, DOI [arXiv:1810.04805, DOI 10.48550/ARXIV.1810.04805]
[8]  
Dosovitskiy A, 2021, Arxiv, DOI [arXiv:2010.11929, DOI 10.48550/ARXIV.2010.11929]
[9]   Label-attention transformer with geometrically coherent objects for image captioning [J].
Dubey, Shikha ;
Olimov, Farrukh ;
Rafique, Muhammad Aasim ;
Kim, Joonmo ;
Jeon, Moongu .
INFORMATION SCIENCES, 2023, 623 :812-831
[10]   Normalized and Geometry-Aware Self-Attention Network for Image Captioning [J].
Guo, Longteng ;
Liu, Jing ;
Zhu, Xinxin ;
Yao, Peng ;
Lu, Shichen ;
Lu, Hanqing .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10324-10333