Heterogeneous Feature Alignment and Fusion in Cross-Modal Augmented Space for Composed Image Retrieval

被引:3
|
作者
Pang, Huaxin [1 ]
Wei, Shikui [1 ]
Zhang, Gangjian [1 ]
Zhang, Shiyin [1 ]
Qiu, Shuang [1 ]
Zhao, Yao [1 ]
机构
[1] Beijing Jiaotong Univ, Inst Informat Sci, Beijing Key Lab Adv Informat Sci & Network Techno, Beijing 100044, Peoples R China
基金
国家重点研发计划;
关键词
Image retrieval; Semantics; Task analysis; Visualization; Transformers; Feature extraction; Fuses; Composed image retrieval; embedding fusion; multi-modal learning; image retrieval; REPRESENTATION; FRAMEWORK;
D O I
10.1109/TMM.2022.3208742
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Composed image retrieval (CIR) aims at fusing a reference image and text feedback to search for the desired images. Compared to general image retrieval, it canmodel the users' search intent more comprehensively and search the target images more accurately, which has significant impacts in various real-world applications, such as E-commerce and Internet search. However, because of the existing heterogeneous semantic gap, the synthetic understanding and fusion of both image and text are difficult to implement. In thiswork, to tackle this difficult problem, we propose an end-to-end framework MCR, which uses text and images as retrieval queries. The framework mainly includes four pivotal modules. Specifically, we introduce the Relative Caption-aware Consistency (RCC) constraint to align text pieces and images in the database, which can effectually bridge the heterogeneous gap. The Multi-modal Complementary Fusion (MCF) and Crossmodal Guided Pooling (CGP) are constructed to mine multiple interactions between image local features and text word features and learn the complementary representation of the composed query. Furthermore, we develop a plug-and-play Weak-text Semantic Augment (WSA) module for datasets with short or incomplete query texts, which can supplement the weak-text features and is conducive to modeling an augmented semantic space. Extensive experiments demonstrate the practical superior performance over the existing state-of-the-art empirical algorithms on several benchmarks.
引用
收藏
页码:6446 / 6457
页数:12
相关论文
共 50 条
  • [31] Augmented Feature Fusion for Image Retrieval System
    Zhou, Yang
    Zeng, Dan
    Zhang, Shiliang
    Tian, Qi
    ICMR'15: PROCEEDINGS OF THE 2015 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2015, : 447 - 450
  • [32] Heterogeneous Attention Network for Effective and Efficient Cross-modal Retrieval
    Yu, Tan
    Yang, Yi
    Li, Yi
    Liu, Lin
    Fei, Hongliang
    Li, Ping
    SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1146 - 1156
  • [33] Deep Multimodal Transfer Learning for Cross-Modal Retrieval
    Zhen, Liangli
    Hu, Peng
    Peng, Xi
    Goh, Rick Siow Mong
    Zhou, Joey Tianyi
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (02) : 798 - 810
  • [34] Cross-Modal Contrastive Learning With Spatiotemporal Context for Correlation-Aware Multiscale Remote Sensing Image Retrieval
    Zhu, Lilu
    Wang, Yang
    Hu, Yanfeng
    Su, Xiaolu
    Fu, Kun
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [35] CAFA: Cross-Modal Attentive Feature Alignment for Cross-Domain Urban Scene Segmentation
    Liu, Peng
    Ge, Yanqi
    Duan, Lixin
    Li, Wen
    Lv, Fengmao
    IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2024, 20 (10) : 11666 - 11675
  • [36] Cross-Domain Image Captioning via Cross-Modal Retrieval and Model Adaptation
    Zhao, Wentian
    Wu, Xinxiao
    Luo, Jiebo
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 1180 - 1192
  • [37] Cross-Modal Adaptive Dual Association for Text-to-Image Person Retrieval
    Lin, Dixuan
    Peng, Yi-Xing
    Meng, Jingke
    Zheng, Wei-Shi
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 6609 - 6620
  • [38] Momentum Cross-Modal Contrastive Learning for Video Moment Retrieval
    Han, De
    Cheng, Xing
    Guo, Nan
    Ye, Xiaochun
    Rainer, Benjamin
    Priller, Peter
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 5977 - 5994
  • [39] Cross-Domain Few-Shot Hyperspectral Image Classification With Cross-Modal Alignment and Supervised Contrastive Learning
    Li, Zhaokui
    Zhang, Chenyang
    Wang, Yan
    Li, Wei
    Du, Qian
    Fang, Zhuoqun
    Chen, Yushi
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 19
  • [40] Cross-Modal Feature Fusion-Based Knowledge Transfer for Text-Based Person Search
    You, Kaiyang
    Chen, Wenjing
    Wang, Chengji
    Sun, Hao
    Xie, Wei
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 2230 - 2234