Heterogeneous Feature Alignment and Fusion in Cross-Modal Augmented Space for Composed Image Retrieval

被引：3

作者：

Pang, Huaxin ^{[1
]}

Wei, Shikui ^{[1
]}

Zhang, Gangjian ^{[1
]}

Zhang, Shiyin ^{[1
]}

Qiu, Shuang ^{[1
]}

Zhao, Yao ^{[1
]}

机构：

[1] Beijing Jiaotong Univ, Inst Informat Sci, Beijing Key Lab Adv Informat Sci & Network Techno, Beijing 100044, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2023年 / 25卷

基金：

国家重点研发计划;

关键词：

Image retrieval; Semantics; Task analysis; Visualization; Transformers; Feature extraction; Fuses; Composed image retrieval; embedding fusion; multi-modal learning; image retrieval; REPRESENTATION; FRAMEWORK;

D O I：

10.1109/TMM.2022.3208742

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Composed image retrieval (CIR) aims at fusing a reference image and text feedback to search for the desired images. Compared to general image retrieval, it canmodel the users' search intent more comprehensively and search the target images more accurately, which has significant impacts in various real-world applications, such as E-commerce and Internet search. However, because of the existing heterogeneous semantic gap, the synthetic understanding and fusion of both image and text are difficult to implement. In thiswork, to tackle this difficult problem, we propose an end-to-end framework MCR, which uses text and images as retrieval queries. The framework mainly includes four pivotal modules. Specifically, we introduce the Relative Caption-aware Consistency (RCC) constraint to align text pieces and images in the database, which can effectually bridge the heterogeneous gap. The Multi-modal Complementary Fusion (MCF) and Crossmodal Guided Pooling (CGP) are constructed to mine multiple interactions between image local features and text word features and learn the complementary representation of the composed query. Furthermore, we develop a plug-and-play Weak-text Semantic Augment (WSA) module for datasets with short or incomplete query texts, which can supplement the weak-text features and is conducive to modeling an augmented semantic space. Extensive experiments demonstrate the practical superior performance over the existing state-of-the-art empirical algorithms on several benchmarks.

引用

页码：6446 / 6457

页数：12

共 50 条

[41] Deep Multiscale Fusion Hashing for Cross-Modal Retrieval
Nie, Xiushan
Wang, Bowei
Li, Jiajia
Hao, Fanchang
Jian, Muwei
Yin, Yilong
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (01) : 401 - 410
[42] Progressive Cross-Modal Semantic Network for Zero-Shot Sketch-Based Image Retrieval
Deng, Cheng
Xu, Xinxun
Wang, Hao
Yang, Muli
Tao, Dacheng
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 8892 - 8902
[43] Asymmetric Supervised Fusion-Oriented Hashing for Cross-Modal Retrieval
Yang, Zhan
Deng, Xiyin
Guo, Lin
Long, Jun
IEEE TRANSACTIONS ON CYBERNETICS, 2024, 54 (02) : 851 - 864
[44] Consistency Center-Based Deep Cross-Modal Hashing for Multisource Remote Sensing Image Retrieval
Sun, Yuxi
Ye, Yunming
Kang, Jian
Fernandez-Beltran, Ruben
Li, Xutao
Xiong, Zhenyu
Huang, Xu
Plaza, Antonio
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
[45] Multi-Manifold Deep Discriminative Cross-Modal Hashing for Medical Image Retrieval
Xu, Liming
Zeng, Xianhua
Zheng, Bochuan
Li, Weisheng
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 3371 - 3385
[46] Improving Cross-Modal Image-Text Retrieval With Teacher-Student Learning
Liu, Junhao
Yang, Min
Li, Chengming
Xu, Ruifeng
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (08) : 3242 - 3253
[47] Cross-Modal Retriever: Unsupervised Image Retrieval with Text and Reference Images
Desai, Padmashree
Kumar, Vivek
Srivastava, Chandan
10TH INTERNATIONAL CONFERENCE ON ELECTRONICS, COMPUTING AND COMMUNICATION TECHNOLOGIES, CONECCT 2024, 2024,
[48] Geometric Matching for Cross-Modal Retrieval
Wang, Zheng
Gao, Zhenwei
Yang, Yang
Wang, Guoqing
Jiao, Chengbo
Shen, Heng Tao
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, : 1 - 13
[49] Deep Supervised Dual Cycle Adversarial Network for Cross-Modal Retrieval
Liao, Lei
Yang, Meng
Zhang, Bob
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (02) : 920 - 934
[50] Cross-Modal Dynamic Networks for Video Moment Retrieval With Text Query
Wang, Gongmian
Xu, Xing
Shen, Fumin
Lu, Huimin
Ji, Yanli
Shen, Heng Tao
IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 1221 - 1232

← 1 2 3 4 5 →