Context-assisted Transformer for Image Captioning

被引:0
作者
Lian Z. [1 ,2 ]
Wang R. [2 ]
Li H.-C. [2 ]
Yao H. [2 ]
Hu X.-H. [2 ]
机构
[1] University of Chinese Academy of Sciences, Beijing
[2] Science & Technology on Integrated Information System Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing
来源
Zidonghua Xuebao/Acta Automatica Sinica | 2023年 / 49卷 / 09期
基金
中国国家自然科学基金;
关键词
attention mechanism; Image captioning; transformer; visual coherence;
D O I
10.16383/j.aas.c220767
中图分类号
学科分类号
摘要
The cross attention mechanism has made significant progress in modeling the relationship between semantic queries and image regions in image captioning. However, its visual coherence remains to be explored. To fill this gap, we propose a novel context-assisted cross attention (CACA) mechanism. With the help of historical context memory (HCM), CACA fully considers the potential impact of previously attended visual cues on the generation of current attention context. Moreover, we present a regularization method, called adaptive weight constraint (AWC), to restrict the total weight assigned to the historical contexts of each CACA module. We apply CACA and AWC to the Transformer model and construct a context-assisted transformer (CAT) for image captioning. Experimental results on the MS COCO (microsoft common objects in context) dataset demonstrate that our method achieves consistent improvement over the current state-of-the-art methods. © 2023 Science Press. All rights reserved.
引用
收藏
页码:1889 / 1903
页数:14
相关论文
共 39 条
[31]  
Rennie S J, Marcheret E, Mroueh Y, Ross J, Goel V., Self-critical sequence training for image captioning, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1179-1195, (2017)
[32]  
Vedantam R, Zitnick C L, Parikh D., Cider: Consensus-based image description evaluation, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4566-4575, (2015)
[33]  
Karpathy A, Li F F., Deep visual-semantic alignments for generating image descriptions, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3128-3137, (2015)
[34]  
Papineni K, Roukos S, Ward T, Zhu W J., Bleu: A method for automatic evaluation of machine translation, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311-318, (2002)
[35]  
Denkowski M J, Lavie A., Meteor universal: Language specific translation evaluation for any target language, Proceedings of the 9th Workshop on Statistical Machine Translation, pp. 376-380, (2014)
[36]  
Lin C Y., Rouge: A package for automatic evaluation of summaries, Proceedings of Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL 2004, pp. 74-81, (2004)
[37]  
Anderson P, Fernando B, Johnson M, Gould S., Spice: Semantic propositional image caption evaluation, Proceedings of European Conference on Computer Vision, pp. 382-398, (2016)
[38]  
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Et al., Visual genome: Connecting language and vision using crowd-sourced dense image annotations, International Journal of Computer Vision, 123, 1, pp. 32-73, (2017)
[39]  
Liu B, Wang D, Yang X, Zhou Y, Yao R, Shao Z, Et al., Show, deconfound and tell: Image captioning with causal inference, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18041-18050, (2022)