FB-Net: Dual-Branch Foreground-Background Fusion Network With Multi-Scale Semantic Scanning for Image-Text Retrieval

被引:1
作者
Xu, Junhao [1 ]
Liu, Zheng [1 ,2 ]
Pei, Xinlei [1 ,2 ]
Wang, Shuhuai [1 ]
Gao, Shanshan [1 ,2 ,3 ]
机构
[1] Shandong Univ Finance & Econ, Sch Comp Sci & Technol, Jinan 250014, Shandong, Peoples R China
[2] Shandong Prov Key Lab Digital Media Technol, Jinan 250014, Shandong, Peoples R China
[3] Shandong China US Digital Media Int Cooperat Res C, Jinan 250014, Shandong, Peoples R China
关键词
Semantics; Information retrieval; Task analysis; Object recognition; Media; Text processing; Distributed databases; Image processing; Image-text retrieval; foreground-background fusion; multi-scale semantic scanning; semantic unit; overlapped sliding window;
D O I
10.1109/ACCESS.2023.3263512
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
As a fundamental branch in cross-modal retrieval, image-text retrieval is still a challenging problem largely due to the complementary and imbalanced relationship between different modalities. However, existing works have not effectively scanned and aligned the semantic units distributed in different granularities of images and texts. To address these issues, we propose a dual-branch foreground-background fusion network (FB-Net), which is implemented by fully exploring and fusing the complementarity in semantic units collected from the foreground and background areas of instances (e.g., images and texts). Firstly, to generate multi-granularity semantic units from images and texts, multi-scale semantic scanning is conducted on both foreground and background areas through multi-level overlapped sliding windows. Secondly, to align semantic units between images and texts, the stacked cross-attention mechanism is used to calculate the initial image-text similarity. Thirdly, to further adaptively optimize the image-text similarity, the dynamically self-adaptive weighted loss is designed. Finally, to perform the image-text retrieval, the similarities between multi-granularity foreground and background semantic units are fused to obtain the final image-text similarity. Experimental results show that our proposed FB-Net outperforms representative state-of-the-art methods for image-text retrieval, and ablation studies further verify the effectiveness of each component in FB-Net.
引用
收藏
页码:36516 / 36537
页数:22
相关论文
共 43 条
  • [1] Andrew G., 2013, INT C MACHINE LEARNI, P1247
  • [2] [Anonymous], 2017, ACM, DOI [DOI 10.1145/3065386, DOI 10.2165/00129785-200404040-00005]
  • [3] [Anonymous], 2014, Adv. Neural Inf.Process. Syst.
  • [4] Multimodal Machine Learning: A Survey and Taxonomy
    Baltrusaitis, Tadas
    Ahuja, Chaitanya
    Morency, Louis-Philippe
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (02) : 423 - 443
  • [5] Cross-modal Graph Matching Network for Image-text Retrieval
    Cheng, Yuhao
    Zhu, Xiaoguang
    Qian, Jiuchao
    Wen, Fei
    Liu, Peilin
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2022, 18 (04)
  • [6] Chua T., 2009, P ACM INT C IM VID R, P48, DOI DOI 10.1145/1646396.1646452
  • [7] Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
  • [8] Diao HW, 2021, AAAI CONF ARTIF INTE, V35, P1218
  • [9] Faghri F., 2017, arXiv
  • [10] Cross-modal Retrieval with Correspondence Autoencoder
    Feng, Fangxiang
    Wang, Xiaojie
    Li, Ruifan
    [J]. PROCEEDINGS OF THE 2014 ACM CONFERENCE ON MULTIMEDIA (MM'14), 2014, : 7 - 16