Towards Fast and Accurate Image-Text Retrieval With Self-Supervised Fine-Grained Alignment

被引：2

作者：

Zhuang, Jiamin ^{[1
,2
]}

Yu, Jing ^{[1
,2
]}

Ding, Yang ^{[1
,2
]}

Qu, Xiangyan ^{[1
,2
]}

Hu, Yue ^{[1
,2
]}

机构：

[1] Chinese Acad Sci, Inst Informat Engn, Beijing 100085, Peoples R China

[2] Univ Chinese Acad Sci, Sch Cyber Secur, Beijing 101408, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2024年 / 26卷

基金：

中国国家自然科学基金;

关键词：

Visualization; Semantics; Image coding; Training; Encoding; Computational modeling; Costs; Fast image-text retrieval; concept-level cross-modal alignment; context-level cross-modal alignment; self-supervised learning;

D O I：

10.1109/TMM.2023.3280734

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Image-text retrieval requires the system to bridge the heterogenous gap between vision and language for accurate retrieval while keeping the network lightweight-enough for efficient retrieval. Existing trade-off solutions mainly study from the view of incorporating cross-modal interactions with the independent-embedding framework or leveraging stronger pre-trained encoders, which still demand time-consuming similarity measurement or heavyweight model structure in the retrieval stage. In this work, we propose an image-text alignment module SelfAlign on top of the independent-embedding framework, which improves the retrieval accuracy while maintains the retrieval efficiency without extra supervision. SelfAlign contains two collaborative sub-modules that force image-text alignment at both the concept level and context level by self-supervised contrastive learning. It doesn't require cross-modal embedding interactions during training while maintaining independent image and text encoders during retrieval. With comparable time cost, SelfAlign consistently boosts the accuracy of state-of-the-art non-pre-training independent-embedding models respectively by 9.1%, 4.2%, and 6.6% in terms of R@sum score on Flickr30 K, MS-COCO 1 K and MS-COCO 5 K datasets. The retrieval accuracy also outperforms most existing interactive-embedding models with orders of magnitude decrease in retrieval time. The source code is available at: https://github.com/Zjamie813/SelfAlign.

引用

页码：1361 / 1372

页数：12

共 50 条

[1] Towards Fast and Accurate Image-Text Retrieval with Self-Supervised Fine-Grained Alignment
Zhuang, Jiamin
Yu, Jing
Ding, Yang
Qu, Xiangyan
Hu, Yue
arXiv, 2023,
[2] ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval
Messina, Nicola
Stefanini, Matteo
Cornia, Marcella
Baraldi, Lorenzo
Falchi, Fabrizio
Amato, Giuseppe
Cucchiara, Rita
19TH INTERNATIONAL CONFERENCE ON CONTENT-BASED MULTIMEDIA INDEXING, CBMI 2022, 2022, : 64 - 70
[3] Memorize, Associate and Match: Embedding Enhancement via Fine-Grained Alignment for Image-Text Retrieval
Li, Jiangtong
Liu, Liu
Niu, Li
Zhang, Liqing
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 (30) : 9193 - 9207
[4] Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation
Lei, Sen
Xiao, Xinyu
Zhang, Tianlin
Li, Heng-Chao
Shi, Zhenwei
Zhu, Qing
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2025, 63
[5] Fine-Grained Image-Text Retrieval via Discriminative Latent Space Learning
Zheng, Min
Wang, Wen
Li, Qingyong
IEEE SIGNAL PROCESSING LETTERS, 2021, 28 (28) : 643 - 647
[6] Fine-Grained Object Classification via Self-Supervised Pose Alignment
Yang, Xuhui
Wang, Yaowei
Chen, Ke
Xu, Yong
Tian, Yonghong
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 7389 - 7398
[7] Fine-grained Feature Assisted Cross-modal Image-text Retrieval
Bu, Chaofei
Liu, Xueliang
Huang, Zhen
Su, Yuling
Tu, Junfeng
Hong, Richang
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT XI, 2025, 15041 : 306 - 320
[8] Medical Image Synthesis via Fine-Grained Image-Text Alignment and Anatomy-Pathology Prompting
Chen, Wenting
Wang, Pengyu
Ren, Hui
Sun, Lichao
Li, Quanzheng
Yuan, Yixuan
Li, Xiang
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT XII, 2024, 15012 : 240 - 250
[9] Fine-Grained Self-Supervised Learning with Jigsaw puzzles for medical image classification
Park W.
Ryu J.
Comput. Biol. Med., 2024,
[10] Fine-Grained Information Supplementation and Value-Guided Learning for Remote Sensing Image-Text Retrieval
Zhou, Zihui
Feng, Yong
Qiu, Agen
Duan, Guofan
Zhou, Mingliang
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2024, 17 : 19194 - 19210

← 1 2 3 4 5 →