Towards Fast and Accurate Image-Text Retrieval With Self-Supervised Fine-Grained Alignment

被引:2
|
作者
Zhuang, Jiamin [1 ,2 ]
Yu, Jing [1 ,2 ]
Ding, Yang [1 ,2 ]
Qu, Xiangyan [1 ,2 ]
Hu, Yue [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Informat Engn, Beijing 100085, Peoples R China
[2] Univ Chinese Acad Sci, Sch Cyber Secur, Beijing 101408, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Semantics; Image coding; Training; Encoding; Computational modeling; Costs; Fast image-text retrieval; concept-level cross-modal alignment; context-level cross-modal alignment; self-supervised learning;
D O I
10.1109/TMM.2023.3280734
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Image-text retrieval requires the system to bridge the heterogenous gap between vision and language for accurate retrieval while keeping the network lightweight-enough for efficient retrieval. Existing trade-off solutions mainly study from the view of incorporating cross-modal interactions with the independent-embedding framework or leveraging stronger pre-trained encoders, which still demand time-consuming similarity measurement or heavyweight model structure in the retrieval stage. In this work, we propose an image-text alignment module SelfAlign on top of the independent-embedding framework, which improves the retrieval accuracy while maintains the retrieval efficiency without extra supervision. SelfAlign contains two collaborative sub-modules that force image-text alignment at both the concept level and context level by self-supervised contrastive learning. It doesn't require cross-modal embedding interactions during training while maintaining independent image and text encoders during retrieval. With comparable time cost, SelfAlign consistently boosts the accuracy of state-of-the-art non-pre-training independent-embedding models respectively by 9.1%, 4.2%, and 6.6% in terms of R@sum score on Flickr30 K, MS-COCO 1 K and MS-COCO 5 K datasets. The retrieval accuracy also outperforms most existing interactive-embedding models with orders of magnitude decrease in retrieval time. The source code is available at: https://github.com/Zjamie813/SelfAlign.
引用
收藏
页码:1361 / 1372
页数:12
相关论文
共 50 条
  • [1] Towards Fast and Accurate Image-Text Retrieval with Self-Supervised Fine-Grained Alignment
    Zhuang, Jiamin
    Yu, Jing
    Ding, Yang
    Qu, Xiangyan
    Hu, Yue
    arXiv, 2023,
  • [2] ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval
    Messina, Nicola
    Stefanini, Matteo
    Cornia, Marcella
    Baraldi, Lorenzo
    Falchi, Fabrizio
    Amato, Giuseppe
    Cucchiara, Rita
    19TH INTERNATIONAL CONFERENCE ON CONTENT-BASED MULTIMEDIA INDEXING, CBMI 2022, 2022, : 64 - 70
  • [3] Memorize, Associate and Match: Embedding Enhancement via Fine-Grained Alignment for Image-Text Retrieval
    Li, Jiangtong
    Liu, Liu
    Niu, Li
    Zhang, Liqing
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 (30) : 9193 - 9207
  • [4] Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation
    Lei, Sen
    Xiao, Xinyu
    Zhang, Tianlin
    Li, Heng-Chao
    Shi, Zhenwei
    Zhu, Qing
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2025, 63
  • [5] Fine-Grained Image-Text Retrieval via Discriminative Latent Space Learning
    Zheng, Min
    Wang, Wen
    Li, Qingyong
    IEEE SIGNAL PROCESSING LETTERS, 2021, 28 (28) : 643 - 647
  • [6] Fine-Grained Object Classification via Self-Supervised Pose Alignment
    Yang, Xuhui
    Wang, Yaowei
    Chen, Ke
    Xu, Yong
    Tian, Yonghong
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 7389 - 7398
  • [7] Fine-grained Feature Assisted Cross-modal Image-text Retrieval
    Bu, Chaofei
    Liu, Xueliang
    Huang, Zhen
    Su, Yuling
    Tu, Junfeng
    Hong, Richang
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT XI, 2025, 15041 : 306 - 320
  • [8] Medical Image Synthesis via Fine-Grained Image-Text Alignment and Anatomy-Pathology Prompting
    Chen, Wenting
    Wang, Pengyu
    Ren, Hui
    Sun, Lichao
    Li, Quanzheng
    Yuan, Yixuan
    Li, Xiang
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT XII, 2024, 15012 : 240 - 250
  • [9] Fine-Grained Self-Supervised Learning with Jigsaw puzzles for medical image classification
    Park W.
    Ryu J.
    Comput. Biol. Med., 2024,
  • [10] Fine-Grained Information Supplementation and Value-Guided Learning for Remote Sensing Image-Text Retrieval
    Zhou, Zihui
    Feng, Yong
    Qiu, Agen
    Duan, Guofan
    Zhou, Mingliang
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2024, 17 : 19194 - 19210