Exploring region relationships implicitly: Image captioning with visual relationship attention

被引：31

作者：

Zhang, Zongjian ^{[1
]}

Wu, Qiang ^{[1
]}

Wang, Yang ^{[1
]}

Chen, Fang ^{[1
]}

机构：

[1] Univ Technol Sydney, 15 Broadway, Sydney, NSW, Australia

来源：

IMAGE AND VISION COMPUTING | 2021年 / 109卷

关键词：

Image captioning; Visual relationship attention; Relationship-level attention parallel attention; mechanism; Learned spatial constraint;

D O I：

10.1016/j.imavis.2021.104146

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual attention mechanism has been widely used by image captioning model in order to dynamically attend to the related visual region based on given language information. Such capability allows a trained model to carry out fine-grained level image understanding and reasoning. However, existing visual attention models only focus on the individual visual region in the image and the alignment between the language representation and related in-dividual visual regions. It does not fully explore the relationships/interactions between visual regions. Further-more, it does not analyze or explore alignment for related words/phrases (e.g. verb or phrasal verb), which may best describe the relationships/interactions between these visual regions. Thus, it causes the inaccurate or impropriate description to the current image captioning model. Instead of visual region attention commonly ad-dressed by existing visual attention mechanism, this paper proposes the novel visual relationship attention via contextualized embedding for individual regions. It can dynamically explore a related visual relationship existing between multiple regions when generating interaction words. Such relationship exploring process is constrained by spatial relationships and driven by the linguistic context of language decoder. In this work, such new visual relationship attention is designed through a parallel attention mechanism under the learned spatial constraint in order to more precisely map visual relationship information to the semantic description of such relationship in language. Different from existing methods for exploring the visual relationship, it is trained implicitly through an unsupervised approach without using any explicit visual relationship annotations. By integrating the newly proposed visual relationship attention with existing visual region attention, our image captioning model can gen-erate high-quality captions. Solid experiments on the MSCOCO dataset demonstrate the proposed visual relation-ship attention can effectively boost the captioning performances by capturing related visual relationships for generating accurate interaction descriptions. (c) 2021 Elsevier B.V. All rights reserved.

引用

页数：10

共 50 条

[1] Visual Relationship Attention for Image Captioning
Zhang, Zongjian
Wu, Qiang
Wang, Yang
Chen, Fang
2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
[2] Exploring Visual Relationship for Image Captioning
Yao, Ting
Pan, Yingwei
Li, Yehao
Mei, Tao
COMPUTER VISION - ECCV 2018, PT XIV, 2018, 11218 : 711 - 727
[3] Social Image Captioning: Exploring Visual Attention and User Attention
Wang, Leiquan
Chu, Xiaoliang
Zhang, Weishan
Wei, Yiwei
Sun, Weichen
Wu, Chunlei
SENSORS, 2018, 18 (02)
[4] Bengali Image Captioning with Visual Attention
Ami, Amit Saha
Humaira, Mayeesha
Jim, Md Abidur Rahman Khan
Paul, Shimul
Shah, Faisal Muhammad
2020 23RD INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION TECHNOLOGY (ICCIT 2020), 2020,
[5] Learning visual relationship and context-aware attention for image captioning
Wang, Junbo
Wang, Wei
Wang, Liang
Wang, Zhiyong
Feng, David Dagan
Tan, Tieniu
PATTERN RECOGNITION, 2020, 98
[6] Image Captioning Based on Visual and Semantic Attention
Wei, Haiyang
Li, Zhixin
Zhang, Canlong
MULTIMEDIA MODELING (MMM 2020), PT I, 2020, 11961 : 151 - 162
[7] Exploring Visual Relationships via Transformer-based Graphs for Enhanced Image Captioning
Li, Jingyu
Mao, Zhendong
Li, Hao
Chen, Weidong
Zhang, Yongdong
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (05)
[8] Exploring Multi-Level Attention and Semantic Relationship for Remote Sensing Image Captioning
Yuan, Zhenghang
Li, Xuelong
Wang, Qi
IEEE ACCESS, 2020, 8 (08): : 2608 - 2620
[9] Graph neural network-based visual relationship and multilevel attention for image captioning
Sharma, Himanshu
Srivastava, Swati
JOURNAL OF ELECTRONIC IMAGING, 2022, 31 (05)
[10] RVAIC: Refined visual attention for improved image captioning
Al-Qatf, Majjed
Hawbani, Ammar
Wang, XingFu
Abdusallam, Amr
Alsamhi, Saeed
Alhabib, Mohammed
Curry, Edward
Journal of Intelligent and Fuzzy Systems, 2024, 46 (02): : 3447 - 3459

← 1 2 3 4 5 →