Context-assisted Transformer for Image Captioning

被引:0
作者
Lian Z. [1 ,2 ]
Wang R. [2 ]
Li H.-C. [2 ]
Yao H. [2 ]
Hu X.-H. [2 ]
机构
[1] University of Chinese Academy of Sciences, Beijing
[2] Science & Technology on Integrated Information System Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing
来源
Zidonghua Xuebao/Acta Automatica Sinica | 2023年 / 49卷 / 09期
基金
中国国家自然科学基金;
关键词
attention mechanism; Image captioning; transformer; visual coherence;
D O I
10.16383/j.aas.c220767
中图分类号
学科分类号
摘要
The cross attention mechanism has made significant progress in modeling the relationship between semantic queries and image regions in image captioning. However, its visual coherence remains to be explored. To fill this gap, we propose a novel context-assisted cross attention (CACA) mechanism. With the help of historical context memory (HCM), CACA fully considers the potential impact of previously attended visual cues on the generation of current attention context. Moreover, we present a regularization method, called adaptive weight constraint (AWC), to restrict the total weight assigned to the historical contexts of each CACA module. We apply CACA and AWC to the Transformer model and construct a context-assisted transformer (CAT) for image captioning. Experimental results on the MS COCO (microsoft common objects in context) dataset demonstrate that our method achieves consistent improvement over the current state-of-the-art methods. © 2023 Science Press. All rights reserved.
引用
收藏
页码:1889 / 1903
页数:14
相关论文
共 39 条
[1]  
Ji J, Luo Y, Sun X, Chen F, Luo G, Wu Y, Et al., Improving image captioning by leveraging intra- and inter-layer global representation in Transformer network, Proceedings of the AAAI Conference on Artificial Intelligence. Virtual Conference, pp. 1655-1663, (2021)
[2]  
Fang Z, Wang J, Hu X, Liang L, Gan Z, Wang L, Et al., Injecting semantic concepts into end-to-end image captioning, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18009-18019, (2022)
[3]  
Tan J H, Tan Y H, Chan C S, Chuah J H., Acort: A compact object relation transformer for parameter efficient image captioning, Neurocomputing, 482, pp. 60-72, (2022)
[4]  
Fei Z., Attention-aligned Transformer for image captioning, Proceedings of the AAAI Conference on Artificial Intelligence, pp. 607-615, (2022)
[5]  
Stefanini M, Cornia M, Baraldi L, Cascianelli S, Fiameni G, Cucchiara R., From show to tell: A survey on deep learning-based image captioning, IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 1, pp. 539-559, (2022)
[6]  
Vinyals O, Toshev A, Bengio S, Erhan D., Show and tell: Lessons learned from the 2015 mscoco image captioning challenge, IEEE Transactions on Multimedia, 39, 4, pp. 652-663, (2016)
[7]  
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Et al., Attention is all you need, Proceedings of Advances in Neural Information Processing Systems, pp. 5998-6008, (2017)
[8]  
Cover T M, Thomas J A., Elements of Information Theory, (2012)
[9]  
Lin T Y, Maire M, Belongie S J, Hays J, Perona P, Ramanan D, Et al., Microsoft coco: Common objects in context, Proceedings of European Conference on Computer Vision, pp. 740-755, (2014)
[10]  
Qin Y, Du J, Zhang Y, Lu H., Look back and predict forward in image captioning, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8367-8375, (2019)