Heterogeneous Feature Alignment and Fusion in Cross-Modal Augmented Space for Composed Image Retrieval

被引：3

作者：

Pang, Huaxin ^{[1
]}

Wei, Shikui ^{[1
]}

Zhang, Gangjian ^{[1
]}

Zhang, Shiyin ^{[1
]}

Qiu, Shuang ^{[1
]}

Zhao, Yao ^{[1
]}

机构：

[1] Beijing Jiaotong Univ, Inst Informat Sci, Beijing Key Lab Adv Informat Sci & Network Techno, Beijing 100044, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2023年 / 25卷

基金：

国家重点研发计划;

关键词：

Image retrieval; Semantics; Task analysis; Visualization; Transformers; Feature extraction; Fuses; Composed image retrieval; embedding fusion; multi-modal learning; image retrieval; REPRESENTATION; FRAMEWORK;

D O I：

10.1109/TMM.2022.3208742

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Composed image retrieval (CIR) aims at fusing a reference image and text feedback to search for the desired images. Compared to general image retrieval, it canmodel the users' search intent more comprehensively and search the target images more accurately, which has significant impacts in various real-world applications, such as E-commerce and Internet search. However, because of the existing heterogeneous semantic gap, the synthetic understanding and fusion of both image and text are difficult to implement. In thiswork, to tackle this difficult problem, we propose an end-to-end framework MCR, which uses text and images as retrieval queries. The framework mainly includes four pivotal modules. Specifically, we introduce the Relative Caption-aware Consistency (RCC) constraint to align text pieces and images in the database, which can effectually bridge the heterogeneous gap. The Multi-modal Complementary Fusion (MCF) and Crossmodal Guided Pooling (CGP) are constructed to mine multiple interactions between image local features and text word features and learn the complementary representation of the composed query. Furthermore, we develop a plug-and-play Weak-text Semantic Augment (WSA) module for datasets with short or incomplete query texts, which can supplement the weak-text features and is conducive to modeling an augmented semantic space. Extensive experiments demonstrate the practical superior performance over the existing state-of-the-art empirical algorithms on several benchmarks.

引用

页码：6446 / 6457

页数：12

共 50 条

[31] Augmented Feature Fusion for Image Retrieval System
Zhou, Yang
Zeng, Dan
Zhang, Shiliang
Tian, Qi
ICMR'15: PROCEEDINGS OF THE 2015 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2015, : 447 - 450
[32] Heterogeneous Attention Network for Effective and Efficient Cross-modal Retrieval
Yu, Tan
Yang, Yi
Li, Yi
Liu, Lin
Fei, Hongliang
Li, Ping
SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1146 - 1156
[33] Deep Multimodal Transfer Learning for Cross-Modal Retrieval
Zhen, Liangli
Hu, Peng
Peng, Xi
Goh, Rick Siow Mong
Zhou, Joey Tianyi
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (02) : 798 - 810
[34] Cross-Modal Contrastive Learning With Spatiotemporal Context for Correlation-Aware Multiscale Remote Sensing Image Retrieval
Zhu, Lilu
Wang, Yang
Hu, Yanfeng
Su, Xiaolu
Fu, Kun
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
[35] CAFA: Cross-Modal Attentive Feature Alignment for Cross-Domain Urban Scene Segmentation
Liu, Peng
Ge, Yanqi
Duan, Lixin
Li, Wen
Lv, Fengmao
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2024, 20 (10) : 11666 - 11675
[36] Cross-Domain Image Captioning via Cross-Modal Retrieval and Model Adaptation
Zhao, Wentian
Wu, Xinxiao
Luo, Jiebo
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 1180 - 1192
[37] Cross-Modal Adaptive Dual Association for Text-to-Image Person Retrieval
Lin, Dixuan
Peng, Yi-Xing
Meng, Jingke
Zheng, Wei-Shi
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 6609 - 6620
[38] Momentum Cross-Modal Contrastive Learning for Video Moment Retrieval
Han, De
Cheng, Xing
Guo, Nan
Ye, Xiaochun
Rainer, Benjamin
Priller, Peter
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 5977 - 5994
[39] Cross-Domain Few-Shot Hyperspectral Image Classification With Cross-Modal Alignment and Supervised Contrastive Learning
Li, Zhaokui
Zhang, Chenyang
Wang, Yan
Li, Wei
Du, Qian
Fang, Zhuoqun
Chen, Yushi
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 19
[40] Cross-Modal Feature Fusion-Based Knowledge Transfer for Text-Based Person Search
You, Kaiyang
Chen, Wenjing
Wang, Chengji
Sun, Hao
Xie, Wei
IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 2230 - 2234

← 1 2 3 4 5 →