Learning Double-Level Relationship Networks for image captioning

被引：8

作者：

Wang, Changzhi ^{[1
]}

Gu, Xiaodong ^{[1
]}

机构：

[1] Fudan Univ, Dept Elect Engn, Shanghai 200438, Peoples R China

来源：

INFORMATION PROCESSING & MANAGEMENT | 2023年 / 60卷 / 03期

基金：

中国国家自然科学基金;

关键词：

Image captioning; Local-global relationship; Relationship network; Graph attention network; ATTENTION;

D O I：

10.1016/j.ipm.2023.103288

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Image captioning aims to generate descriptive sentences to describe image main contents. Existing attention-based approaches mainly focus on the salient visual features in the image. However, ignoring the learning relationship between local features and global features may cause local features to lose the interaction with global concepts, generating impropriate or inaccurate relationship words/phrases in the sentences. To alleviate the above issue, in this work we propose the Double-Level Relationship Networks (DLRN) that novelly exploits the complementary local features and global features in the image, and enhances the relationship between features. Technically, DLRN builds two types of networks, separate relationship network and unified relationship embedding network. The former learns different hierarchies of visual relationship by performing graph attention for local-level relationship enhancement and pixel-level relationship enhancement respectively. The latter takes the global features as the guide to learn the local-global relationship between local regions and global concepts, and obtains the feature representation containing rich relationship information. Further, we devise an attention-based feature fusion module to fully utilize the contribution of different modalities. It effectively fuses the previously obtained relationship features and original region features. Extensive experiments on three typical datasets verify that our DLRN significantly outperforms several state-of-the-art baselines. More remarkably, DLRN achieves the competitive performance while maintaining notable model efficiency. The source code is available at the GitHub https://github.com/RunCode90/ImageCaptioning.

引用

页数：24

共 50 条

[1] Triple-level relationship enhanced transformer for image captioning
Zheng, Anqi
Zheng, Shiqi
Bai, Cong
Chen, Deng
MULTIMEDIA SYSTEMS, 2023, 29 (04) : 1955 - 1966
[2] Triple-level relationship enhanced transformer for image captioning
Anqi Zheng
Shiqi Zheng
Cong Bai
Deng Chen
Multimedia Systems, 2023, 29 : 1955 - 1966
[3] Learning joint relationship attention network for image captioning
Wang, Changzhi
Gu, Xiaodong
EXPERT SYSTEMS WITH APPLICATIONS, 2023, 211
[4] Multi-level Visual Fusion Networks for Image Captioning
Zhou, Dongming
Zhang, Canlong
Li, Zhixin
Wang, Zhiwen
2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
[5] Double-Stream Position Learning Transformer Network for Image Captioning
Jiang, Weitao
Zhou, Wei
Hu, Haifeng
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (11) : 7706 - 7718
[6] Double awareness mechanism based deep learning framework for image captioning
Gaurav
Mathur, Pratistha
JOURNAL OF DISCRETE MATHEMATICAL SCIENCES & CRYPTOGRAPHY, 2023, 26 (06): : 1801 - 1817
[7] Learning visual relationship and context-aware attention for image captioning
Wang, Junbo
Wang, Wei
Wang, Liang
Wang, Zhiyong
Feng, David Dagan
Tan, Tieniu
PATTERN RECOGNITION, 2020, 98
[8] Dual Graph Convolutional Networks with Transformer and Curriculum Learning for Image Captioning
Dong, Xinzhi
Long, Chengjiang
Xu, Wenju
Xiao, Chunxia
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2615 - 2624
[9] Semantic Representations With Attention Networks for Boosting Image Captioning
Hafeth, Deema Abdal
Kollias, Stefanos
Ghafoor, Mubeen
IEEE ACCESS, 2023, 11 : 40230 - 40239
[10] IMAGE CAPTIONING WITH WORD LEVEL ATTENTION
Fang, Fang
Wang, Hanli
Tang, Pengjie
2018 25TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2018, : 1278 - 1282

← 1 2 3 4 5 →