Exploring region relationships implicitly: Image captioning with visual relationship attention

被引:31
|
作者
Zhang, Zongjian [1 ]
Wu, Qiang [1 ]
Wang, Yang [1 ]
Chen, Fang [1 ]
机构
[1] Univ Technol Sydney, 15 Broadway, Sydney, NSW, Australia
关键词
Image captioning; Visual relationship attention; Relationship-level attention parallel attention; mechanism; Learned spatial constraint;
D O I
10.1016/j.imavis.2021.104146
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual attention mechanism has been widely used by image captioning model in order to dynamically attend to the related visual region based on given language information. Such capability allows a trained model to carry out fine-grained level image understanding and reasoning. However, existing visual attention models only focus on the individual visual region in the image and the alignment between the language representation and related in-dividual visual regions. It does not fully explore the relationships/interactions between visual regions. Further-more, it does not analyze or explore alignment for related words/phrases (e.g. verb or phrasal verb), which may best describe the relationships/interactions between these visual regions. Thus, it causes the inaccurate or impropriate description to the current image captioning model. Instead of visual region attention commonly ad-dressed by existing visual attention mechanism, this paper proposes the novel visual relationship attention via contextualized embedding for individual regions. It can dynamically explore a related visual relationship existing between multiple regions when generating interaction words. Such relationship exploring process is constrained by spatial relationships and driven by the linguistic context of language decoder. In this work, such new visual relationship attention is designed through a parallel attention mechanism under the learned spatial constraint in order to more precisely map visual relationship information to the semantic description of such relationship in language. Different from existing methods for exploring the visual relationship, it is trained implicitly through an unsupervised approach without using any explicit visual relationship annotations. By integrating the newly proposed visual relationship attention with existing visual region attention, our image captioning model can gen-erate high-quality captions. Solid experiments on the MSCOCO dataset demonstrate the proposed visual relation-ship attention can effectively boost the captioning performances by capturing related visual relationships for generating accurate interaction descriptions. (c) 2021 Elsevier B.V. All rights reserved.
引用
收藏
页数:10
相关论文
共 50 条
  • [21] Local-global visual interaction attention for image captioning
    Wang, Changzhi
    Gu, Xiaodong
    DIGITAL SIGNAL PROCESSING, 2022, 130
  • [22] Image Captioning using Visual Attention and Detection Transformer Model
    Eluri, Yaswanth
    Vinutha, N.
    Jeevika, M.
    Sree, Sai Bhavya N.
    Abhiram, G. Surya
    10TH INTERNATIONAL CONFERENCE ON ELECTRONICS, COMPUTING AND COMMUNICATION TECHNOLOGIES, CONECCT 2024, 2024,
  • [23] Image Captioning Based on Visual Relevance and Context Dual Attention
    Liu M.-F.
    Shi Q.
    Nie L.-Q.
    Ruan Jian Xue Bao/Journal of Software, 2022, 33 (09):
  • [24] Visual contextual relationship augmented transformer for image captioning
    Su, Qiang
    Hu, Junbo
    Li, Zhixin
    APPLIED INTELLIGENCE, 2024, 54 (06) : 4794 - 4813
  • [25] Visual contextual relationship augmented transformer for image captioning
    Qiang Su
    Junbo Hu
    Zhixin Li
    Applied Intelligence, 2024, 54 : 4794 - 4813
  • [26] Exploring Semantic Relationships for Image Captioning without Parallel Data
    Liu, Fenglin
    Gao, Meng
    Zhang, Tianhao
    Zou, Yuexian
    2019 19TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2019), 2019, : 439 - 448
  • [27] Guiding Attention using Partial-Order Relationships for Image Captioning
    Popattia, Murad
    Rafi, Muhammad
    Qureshi, Rizwan
    Nawaz, Shah
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 4670 - 4679
  • [28] VAA: Visual Aligning Attention Model for Remote Sensing Image Captioning
    Zhang, Zhengyuan
    Zhang, Wenkai
    Diao, Wenhui
    Yan, Menglong
    Ga, Xin
    Sun, Xian
    IEEE ACCESS, 2019, 7 : 137355 - 137364
  • [29] Modeling visual and word-conditional semantic attention for image captioning
    Wu, Chunlei
    Wei, Yiwei
    Chu, Xiaoliang
    Su, Fei
    Wang, Leiquan
    SIGNAL PROCESSING-IMAGE COMMUNICATION, 2018, 67 : 100 - 107
  • [30] Boosting convolutional image captioning with semantic content and visual relationship
    Bai, Cong
    Zheng, Anqi
    Huang, Yuan
    Pan, Xiang
    Chen, Nan
    DISPLAYS, 2021, 70