Image Caption Generation with Hierarchical Contextual Visual Spatial Attention

被引：10

作者：

Khademi, Mahmoud ^{[1
]}

Schulte, Oliver ^{[1
]}

机构：

[1] Simon Fraser Univ, Burnaby, BC, Canada

来源：

PROCEEDINGS 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW) | 2018年

关键词：

D O I：

10.1109/CVPRW.2018.00260

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We present a novel context-aware attention-based deep architecture for image caption generation. Our architecture employs a Bidirectional Grid LSTM, which takes visual features of an image as input and learns complex spatial patterns based on two-dimensional context, by selecting or ignoring its input. The Grid LSTM has not been applied to image caption generation task before. Another novel aspect is that we leverage a set of local region-grounded texts obtained by transfer learning. The region-grounded texts often describe the properties of the objects and their relationships in an image. To generate a global caption for the image, we integrate the spatial features from the Grid LSTM with the local region-grounded texts, using a two-layer Bidirectional LSTM. The first layer models the global scene context such as object presence. The second layer utilizes a novel dynamic spatial attention mechanism, based on another Grid LSTM, to generate the global caption word-by-word, while considering the caption context around a word in both directions. Unlike recent models that use a soft attention mechanism, our dynamic spatial attention mechanism considers the spatial context of the image regions. Experimental results on MS-COCO dataset show that our architecture outperforms the state-of-the-art.

引用

页码：2024 / 2032

页数：9

共 50 条

[41] Attention-based Visual-Audio Fusion for Video Caption Generation
Guo, Ningning
Liu, Huaping
Jiang, Linhua
[J]. 2019 IEEE 4TH INTERNATIONAL CONFERENCE ON ADVANCED ROBOTICS AND MECHATRONICS (ICARM 2019), 2019, : 839 - 844
[42] Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s
Zhang, Huawei
Ma, Chengbo
Jiang, Zhanjun
Lian, Jing
[J]. IEEE ACCESS, 2023, 11 : 134 - 143
[43] Mind's Eye: A Recurrent Visual Representation for Image Caption Generation
Chen, Xinlei
Zitnick, C. Lawrence
[J]. 2015 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2015, : 2422 - 2431
[44] TVPRNN for image caption generation
Yang, Liang
Hu, Haifeng
[J]. ELECTRONICS LETTERS, 2017, 53 (22) : 1471 - +
[45] Hierarchical Attention-Based Fusion for Image Caption With Multi-Grained Rewards
Wu, Chunlei
Yuan, Shaozu
Cao, Haiwen
Wei, Yiwei
Wang, Leiquan
[J]. IEEE ACCESS, 2020, 8 (08): : 57943 - 57951
[46] Attention based sequence-to-sequence framework for auto image caption generation
Khan, Rashid
Islam, M. Shujah
Kanwal, Khadija
Iqbal, Mansoor
Hossain, Md Imran
Ye, Zhongfu
[J]. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2022, 43 (01) : 159 - 170
[47] Image Caption with Endogenous–Exogenous Attention
Teng Wang
Haifeng Hu
Chen He
[J]. Neural Processing Letters, 2019, 50 : 431 - 443
[48] CNN image caption generation
Li Y.
Cheng H.
Liang X.
Guo Q.
Qian Y.
[J]. Xi'an Dianzi Keji Daxue Xuebao/Journal of Xidian University, 2019, 46 (02): : 152 - 157
[49] Enhancing image caption generation through context-aware attention mechanism
Bhuiyan, Ahatesham
Hossain, Eftekhar
Hoque, Mohammed Moshiul
Dewan, M. Ali Akber
[J]. HELIYON, 2024, 10 (17)
[50] Hierarchical cross-modal contextual attention network for visual grounding
Xu, Xin
Lv, Gang
Sun, Yining
Hu, Yuxia
Nian, Fudong
[J]. MULTIMEDIA SYSTEMS, 2023, 29 (04) : 2073 - 2083

← 1 2 3 4 5 →