Multi-level network based on transformer encoder for fine-grained image-text matching

被引：3

作者：

Yang, Lei ^{[1
]}

Feng, Yong ^{[1
]}

Zhou, Mingliang ^{[1
]}

Xiong, Xiancai ^{[2
,3
]}

Wang, Yongheng ^{[4
]}

Qiang, Baohua ^{[5
]}

机构：

[1] Chongqing Univ, Coll Comp Sci, Chongqing 400044, Peoples R China

[2] Minist Nat Resources, Key Lab Monitoring Evaluat & Early Warning Terr Sp, Chongqing 401147, Peoples R China

[3] Chongqing Inst Planning & Nat Resources Invest & M, Chongqing 401121, Peoples R China

[4] Zhejiang Lab, Hangzhou 311121, Peoples R China

[5] Guilin Univ Elect Technol, Guangxi Key Lab Trusted Software, Guilin 541004, Peoples R China

来源：

MULTIMEDIA SYSTEMS | 2023年 / 29卷 / 04期

基金：

中国国家自然科学基金;

关键词：

Image-text matching; Multi-level network; Transformer encoder; Fine-grained information; INFORMATION;

D O I：

10.1007/s00530-023-01079-w

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Enabling image-text matching is important to understand both vision and language. Existing methods utilize the cross attention mechanism to explore deep semantic information. However, the majority of these methods need to perform two types of alignment, which is extremely time-consuming. In addition, current methods do not consider the digital information within the image or text, which may lead to a reduction in retrieval performance. In this paper, we propose a multi-level network, which is based on the transformer encoder for fine-grained, image-text matching. First, we use the transformer encoder to extract intra-modality relations within the image and text and perform the alignment through an efficient aggregating method, rendering the alignment more efficient and the intra-modality information fully utilized. Second, we capture the discriminative digital information within the image and text to make the representation more distinguishable. Finally, we utilize the global information of the image and text as complementary information to enhance the representation. According to our experimental results, significant improvements in terms of retrieval tasks and runtime estimation can be achieved compared with state-of-the-art algorithms. The source code is available at https://github.com/CQULab/MNTE.

引用

页码：1981 / 1994

页数：14

共 45 条

[1] IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval [J].

Chen, Hui ;

Ding, Guiguang ;

Liu, Xudong ;

Lin, Zijia ;

Liu, Ji ;

Han, Jungong .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :12652-12660

[2]

Devlin J, 2019, Arxiv, DOI [arXiv:1810.04805, 10.48550/arxiv.1810.04805]

[3] Linking Image and Text with 2-Way Nets [J].

Eisenschtat, Aviv ;

Wolf, Lior .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1855-1865

[4]

Faghri F, 2018, Arxiv, DOI arXiv:1707.05612

[5] Cross-modal Retrieval with Correspondence Autoencoder [J].

Feng, Fangxiang ;

Wang, Xiaojie ;

Li, Ruifan .

PROCEEDINGS OF THE 2014 ACM CONFERENCE ON MULTIMEDIA (MM'14), 2014, :7-16

[6] Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval [J].

Ge, Xuri ;

Chen, Fuhai ;

Jose, Joemon M. ;

Ji, Zhilong ;

Wu, Zhongqin ;

Liu, Xiao .

PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, :5185-5193

[7] Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models [J].

Gu, Jiuxiang ;

Cai, Jianfei ;

Joty, Shafiq ;

Niu, Li ;

Wang, Gang .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7181-7189

[8] Canonical correlation analysis: An overview with application to learning methods [J].

Hardoon, DR ;

Szedmak, S ;

Shawe-Taylor, J .

NEURAL COMPUTATION, 2004, 16 (12) :2639-2664

[9] Deep Residual Learning for Image Recognition [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778

[10]

Hotelling H, 1936, BIOMETRIKA, V28, P321, DOI 10.2307/2333955

← 1 2 3 4 5 →