Towards Fast and Accurate Image-Text Retrieval With Self-Supervised Fine-Grained Alignment

被引:7
作者
Zhuang, Jiamin [1 ,2 ]
Yu, Jing [1 ,2 ]
Ding, Yang [1 ,2 ]
Qu, Xiangyan [1 ,2 ]
Hu, Yue [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Informat Engn, Beijing 100085, Peoples R China
[2] Univ Chinese Acad Sci, Sch Cyber Secur, Beijing 101408, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Semantics; Image coding; Training; Encoding; Computational modeling; Costs; Fast image-text retrieval; concept-level cross-modal alignment; context-level cross-modal alignment; self-supervised learning;
D O I
10.1109/TMM.2023.3280734
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Image-text retrieval requires the system to bridge the heterogenous gap between vision and language for accurate retrieval while keeping the network lightweight-enough for efficient retrieval. Existing trade-off solutions mainly study from the view of incorporating cross-modal interactions with the independent-embedding framework or leveraging stronger pre-trained encoders, which still demand time-consuming similarity measurement or heavyweight model structure in the retrieval stage. In this work, we propose an image-text alignment module SelfAlign on top of the independent-embedding framework, which improves the retrieval accuracy while maintains the retrieval efficiency without extra supervision. SelfAlign contains two collaborative sub-modules that force image-text alignment at both the concept level and context level by self-supervised contrastive learning. It doesn't require cross-modal embedding interactions during training while maintaining independent image and text encoders during retrieval. With comparable time cost, SelfAlign consistently boosts the accuracy of state-of-the-art non-pre-training independent-embedding models respectively by 9.1%, 4.2%, and 6.6% in terms of R@sum score on Flickr30 K, MS-COCO 1 K and MS-COCO 5 K datasets. The retrieval accuracy also outperforms most existing interactive-embedding models with orders of magnitude decrease in retrieval time. The source code is available at: https://github.com/Zjamie813/SelfAlign.
引用
收藏
页码:1361 / 1372
页数:12
相关论文
共 50 条
[31]   A Lightweight Multi-Grained Image-Text Retrieval Paradigm via Cascaded Representation Learning and Parameter-Free Feature Aggregation [J].
Lu, Chenyu ;
Zhang, Nan ;
Sun, Shiliang .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (12) :13584-13595
[32]   Self-Supervised Fine-Grained Cycle-Separation Network (FSCN) for Visual-Audio Separation [J].
Ji, Yanli ;
Ma, Shuo ;
Xu, Xing ;
Li, Xuelong ;
Shen, Heng Tao .
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 :5864-5876
[33]   SELF-SUPERVISED ALIGNMENT LEARNING FOR MEDICAL IMAGE SEGMENTATION [J].
Li, Haofeng ;
Ouyang, Yiming ;
Wan, Xiang .
IEEE INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING, ISBI 2024, 2024,
[34]   Multi-Sentence Auxiliary Adversarial Networks for Fine-Grained Text-to-Image Synthesis [J].
Yang, Yanhua ;
Wang, Lei ;
Xie, De ;
Deng, Cheng ;
Tao, Dacheng .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 (30) :2798-2809
[35]   A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing [J].
Cheng, Qimin ;
Zhou, Yuzhuo ;
Fu, Peng ;
Xu, Yuan ;
Zhang, Liang .
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2021, 14 :4284-4297
[36]   Text-to-image synthesis with self-supervised learning [J].
Tan, Yong Xuan ;
Lee, Chin Poo ;
Neo, Mai ;
Lim, Kian Ming .
PATTERN RECOGNITION LETTERS, 2022, 157 :119-126
[37]   TOAN: Target-Oriented Alignment Network for Fine-Grained Image Categorization With Few Labeled Samples [J].
Huang, Huaxi ;
Zhang, Junjie ;
Yu, Litao ;
Zhang, Jian ;
Wu, Qiang ;
Xu, Chang .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (02) :853-866
[38]   Self-supervised representation learning for robust fine-grained human hand action recognition in industrial assembly lines [J].
Sturm, Fabian ;
Trat, Martin ;
Sathiyababu, Rahul ;
Allipilli, Harshitha ;
Menz, Benjamin ;
Hergenroether, Elke ;
Siegel, Melanie .
MACHINE VISION AND APPLICATIONS, 2025, 36 (01)
[39]   Fine-grained visual classification with multi-scale features based on self-supervised attention filtering mechanism [J].
Chen, Haiyuan ;
Cheng, Lianglun ;
Huang, Guoheng ;
Zhang, Ganghan ;
Lan, Jiaying ;
Yu, Zhiwen ;
Pun, Chi-Man ;
Ling, Wing-Kuen .
APPLIED INTELLIGENCE, 2022, 52 (13) :15673-15689
[40]   Fine-grained visual classification with multi-scale features based on self-supervised attention filtering mechanism [J].
Haiyuan Chen ;
Lianglun Cheng ;
Guoheng Huang ;
Ganghan Zhang ;
Jiaying Lan ;
Zhiwen Yu ;
Chi-Man Pun ;
Wing-Kuen Ling .
Applied Intelligence, 2022, 52 :15673-15689