Dual Stream Relation Learning Network for Image-Text Retrieval

被引:0
作者
Wu, Dongqing [1 ]
Li, Huihui [1 ]
Gu, Cang [1 ]
Guo, Lei [1 ]
Liu, Hang [2 ]
机构
[1] Northwestern Polytech Univ, Sch Automat, Xian 710072, Peoples R China
[2] Northwestern Polytech Univ, Sch Cybersecur, Xian 710072, Peoples R China
关键词
Visualization; Semantics; Feature extraction; Cognition; Noise; Accuracy; Logic gates; Information filters; Encoding; Correlation; Image-text retrieval; region feature; grid feature; self-attention; CONTEXT;
D O I
10.1109/TMM.2024.3521736
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Image-text retrieval has made remarkable achievements through the development of feature extraction networks and model architectures. However, almost all region feature-based methods face two serious problems when modeling modality interactions. First, region features are prone to feature entanglement in the feature extraction stage, making it difficult to accurately reason complex intra-model relations between visual objects. Second, region features lack rich contextual information, background, and object details, making it difficult to achieve precise inter-modal alignment with textual information. In this paper, we propose a novel Dual Stream Relation Learning Network (DSRLN) to jointly solve these issues with two key components: a Geometry-sensitive Interactive Self-Attention (GISA) module and a Dual Information Fusion (DIF) module. Specifically, GISA extends the vanilla self-attention network from two aspects to better model the intrinsic relationships between different regions, thereby improving high-level visual-semantic reasoning ability. DIF uses grid features as an additional visual information source, and achieves deeper and complex fusion between the two types of features through a masked cross-attention module and an adaptive gate fusion module, which can capture comprehensive visual information to learn more precise inter-modal alignment. Besides, our method also learns a more comprehensive hierarchical correspondence between images and sentences through local and global alignment. Experimental results on two public datasets, i.e., Flickr30K and MS-COCO, fully demonstrate the superiority and effectiveness of our model.
引用
收藏
页码:1551 / 1565
页数:15
相关论文
共 62 条
[1]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[2]  
[Anonymous], 2013, 27 INT C NEUR INF PR
[3]   Global Relation-Aware Attention Network for Image-Text Retrieval [J].
Cao, Jie ;
Qian, Shengsheng ;
Zhang, Huaiwen ;
Fang, Quan ;
Xu, Changsheng .
PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, :19-28
[4]   IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval [J].
Chen, Hui ;
Ding, Guiguang ;
Liu, Xudong ;
Lin, Zijia ;
Liu, Ji ;
Han, Jungong .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :12652-12660
[5]   Learning the Best Pooling Strategy for Visual Semantic Embedding [J].
Chen, Jiacheng ;
Hu, Hexiang ;
Wu, Hao ;
Jiang, Yuning ;
Wang, Changhu .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :15784-15793
[6]  
Chen TL, 2020, AAAI CONF ARTIF INTE, V34, P10583
[7]   Cross-modal Graph Matching Network for Image-text Retrieval [J].
Cheng, Yuhao ;
Zhu, Xiaoguang ;
Qian, Jiuchao ;
Wen, Fei ;
Liu, Peilin .
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2022, 18 (04)
[8]   Plug-and-Play Regulators for Image-Text Matching [J].
Diao, Haiwen ;
Zhang, Ying ;
Liu, Wei ;
Ruan, Xiang ;
Lu, Huchuan .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 :2322-2334
[9]  
Diao HW, 2021, AAAI CONF ARTIF INTE, V35, P1218
[10]  
Faghri F., 2018, BMVC, P1