Towards Fast and Accurate Image-Text Retrieval With Self-Supervised Fine-Grained Alignment

被引：7

作者：

Zhuang, Jiamin ^{[1
,2
]}

Yu, Jing ^{[1
,2
]}

Ding, Yang ^{[1
,2
]}

Qu, Xiangyan ^{[1
,2
]}

Hu, Yue ^{[1
,2
]}

机构：

[1] Chinese Acad Sci, Inst Informat Engn, Beijing 100085, Peoples R China

[2] Univ Chinese Acad Sci, Sch Cyber Secur, Beijing 101408, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2024年 / 26卷

基金：

中国国家自然科学基金;

关键词：

Visualization; Semantics; Image coding; Training; Encoding; Computational modeling; Costs; Fast image-text retrieval; concept-level cross-modal alignment; context-level cross-modal alignment; self-supervised learning;

D O I：

10.1109/TMM.2023.3280734

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Image-text retrieval requires the system to bridge the heterogenous gap between vision and language for accurate retrieval while keeping the network lightweight-enough for efficient retrieval. Existing trade-off solutions mainly study from the view of incorporating cross-modal interactions with the independent-embedding framework or leveraging stronger pre-trained encoders, which still demand time-consuming similarity measurement or heavyweight model structure in the retrieval stage. In this work, we propose an image-text alignment module SelfAlign on top of the independent-embedding framework, which improves the retrieval accuracy while maintains the retrieval efficiency without extra supervision. SelfAlign contains two collaborative sub-modules that force image-text alignment at both the concept level and context level by self-supervised contrastive learning. It doesn't require cross-modal embedding interactions during training while maintaining independent image and text encoders during retrieval. With comparable time cost, SelfAlign consistently boosts the accuracy of state-of-the-art non-pre-training independent-embedding models respectively by 9.1%, 4.2%, and 6.6% in terms of R@sum score on Flickr30 K, MS-COCO 1 K and MS-COCO 5 K datasets. The retrieval accuracy also outperforms most existing interactive-embedding models with orders of magnitude decrease in retrieval time. The source code is available at: https://github.com/Zjamie813/SelfAlign.

引用

页码：1361 / 1372

页数：12

共 50 条

[31] A Lightweight Multi-Grained Image-Text Retrieval Paradigm via Cascaded Representation Learning and Parameter-Free Feature Aggregation [J].

Lu, Chenyu ;

Zhang, Nan ;

Sun, Shiliang .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (12) :13584-13595

[32] Self-Supervised Fine-Grained Cycle-Separation Network (FSCN) for Visual-Audio Separation [J].

Ji, Yanli ;

Ma, Shuo ;

Xu, Xing ;

Li, Xuelong ;

Shen, Heng Tao .

IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 :5864-5876

[33] SELF-SUPERVISED ALIGNMENT LEARNING FOR MEDICAL IMAGE SEGMENTATION [J].

Li, Haofeng ;

Ouyang, Yiming ;

Wan, Xiang .

IEEE INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING, ISBI 2024, 2024,

[34] Multi-Sentence Auxiliary Adversarial Networks for Fine-Grained Text-to-Image Synthesis [J].

Yang, Yanhua ;

Wang, Lei ;

Xie, De ;

Deng, Cheng ;

Tao, Dacheng .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 (30) :2798-2809

[35] A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing [J].

Cheng, Qimin ;

Zhou, Yuzhuo ;

Fu, Peng ;

Xu, Yuan ;

Zhang, Liang .

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2021, 14 :4284-4297

[36] Text-to-image synthesis with self-supervised learning [J].

Tan, Yong Xuan ;

Lee, Chin Poo ;

Neo, Mai ;

Lim, Kian Ming .

PATTERN RECOGNITION LETTERS, 2022, 157 :119-126

[37] TOAN: Target-Oriented Alignment Network for Fine-Grained Image Categorization With Few Labeled Samples [J].

Huang, Huaxi ;

Zhang, Junjie ;

Yu, Litao ;

Zhang, Jian ;

Wu, Qiang ;

Xu, Chang .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (02) :853-866

[38] Self-supervised representation learning for robust fine-grained human hand action recognition in industrial assembly lines [J].

Sturm, Fabian ;

Trat, Martin ;

Sathiyababu, Rahul ;

Allipilli, Harshitha ;

Menz, Benjamin ;

Hergenroether, Elke ;

Siegel, Melanie .

MACHINE VISION AND APPLICATIONS, 2025, 36 (01)

[39] Fine-grained visual classification with multi-scale features based on self-supervised attention filtering mechanism [J].

Chen, Haiyuan ;

Cheng, Lianglun ;

Huang, Guoheng ;

Zhang, Ganghan ;

Lan, Jiaying ;

Yu, Zhiwen ;

Pun, Chi-Man ;

Ling, Wing-Kuen .

APPLIED INTELLIGENCE, 2022, 52 (13) :15673-15689

[40] Fine-grained visual classification with multi-scale features based on self-supervised attention filtering mechanism [J].

Haiyuan Chen ;

Lianglun Cheng ;

Guoheng Huang ;

Ganghan Zhang ;

Jiaying Lan ;

Zhiwen Yu ;

Chi-Man Pun ;

Wing-Kuen Ling .

Applied Intelligence, 2022, 52 :15673-15689

← 1 2 3 4 5 →