FB-Net: Dual-Branch Foreground-Background Fusion Network With Multi-Scale Semantic Scanning for Image-Text Retrieval

被引：1

作者：

Xu, Junhao ^{[1
]}

Liu, Zheng ^{[1
,2
]}

Pei, Xinlei ^{[1
,2
]}

Wang, Shuhuai ^{[1
]}

Gao, Shanshan ^{[1
,2
,3
]}

机构：

[1] Shandong Univ Finance & Econ, Sch Comp Sci & Technol, Jinan 250014, Shandong, Peoples R China

[2] Shandong Prov Key Lab Digital Media Technol, Jinan 250014, Shandong, Peoples R China

[3] Shandong China US Digital Media Int Cooperat Res C, Jinan 250014, Shandong, Peoples R China

来源：

IEEE ACCESS | 2023年 / 11卷

关键词：

Semantics; Information retrieval; Task analysis; Object recognition; Media; Text processing; Distributed databases; Image processing; Image-text retrieval; foreground-background fusion; multi-scale semantic scanning; semantic unit; overlapped sliding window;

D O I：

10.1109/ACCESS.2023.3263512

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

As a fundamental branch in cross-modal retrieval, image-text retrieval is still a challenging problem largely due to the complementary and imbalanced relationship between different modalities. However, existing works have not effectively scanned and aligned the semantic units distributed in different granularities of images and texts. To address these issues, we propose a dual-branch foreground-background fusion network (FB-Net), which is implemented by fully exploring and fusing the complementarity in semantic units collected from the foreground and background areas of instances (e.g., images and texts). Firstly, to generate multi-granularity semantic units from images and texts, multi-scale semantic scanning is conducted on both foreground and background areas through multi-level overlapped sliding windows. Secondly, to align semantic units between images and texts, the stacked cross-attention mechanism is used to calculate the initial image-text similarity. Thirdly, to further adaptively optimize the image-text similarity, the dynamically self-adaptive weighted loss is designed. Finally, to perform the image-text retrieval, the similarities between multi-granularity foreground and background semantic units are fused to obtain the final image-text similarity. Experimental results show that our proposed FB-Net outperforms representative state-of-the-art methods for image-text retrieval, and ablation studies further verify the effectiveness of each component in FB-Net.

引用

页码：36516 / 36537

页数：22

共 43 条

[1] Andrew G., 2013, INT C MACHINE LEARNI, P1247
[2] [Anonymous], 2017, ACM, DOI [DOI 10.1145/3065386, DOI 10.2165/00129785-200404040-00005]
[3] [Anonymous], 2014, Adv. Neural Inf.Process. Syst.
[4] Multimodal Machine Learning: A Survey and Taxonomy
Baltrusaitis, Tadas
Ahuja, Chaitanya
Morency, Louis-Philippe
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (02) : 423 - 443
[5] Cross-modal Graph Matching Network for Image-text Retrieval
Cheng, Yuhao
Zhu, Xiaoguang
Qian, Jiuchao
Wen, Fei
Liu, Peilin
[J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2022, 18 (04)
[6] Chua T., 2009, P ACM INT C IM VID R, P48, DOI DOI 10.1145/1646396.1646452
[7] Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[8] Diao HW, 2021, AAAI CONF ARTIF INTE, V35, P1218
[9] Faghri F., 2017, arXiv
[10] Cross-modal Retrieval with Correspondence Autoencoder
Feng, Fangxiang
Wang, Xiaojie
Li, Ruifan
[J]. PROCEEDINGS OF THE 2014 ACM CONFERENCE ON MULTIMEDIA (MM'14), 2014, : 7 - 16

← 1 2 3 4 5 →