Heterogeneous Feature Alignment and Fusion in Cross-Modal Augmented Space for Composed Image Retrieval

被引:3
|
作者
Pang, Huaxin [1 ]
Wei, Shikui [1 ]
Zhang, Gangjian [1 ]
Zhang, Shiyin [1 ]
Qiu, Shuang [1 ]
Zhao, Yao [1 ]
机构
[1] Beijing Jiaotong Univ, Inst Informat Sci, Beijing Key Lab Adv Informat Sci & Network Techno, Beijing 100044, Peoples R China
基金
国家重点研发计划;
关键词
Image retrieval; Semantics; Task analysis; Visualization; Transformers; Feature extraction; Fuses; Composed image retrieval; embedding fusion; multi-modal learning; image retrieval; REPRESENTATION; FRAMEWORK;
D O I
10.1109/TMM.2022.3208742
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Composed image retrieval (CIR) aims at fusing a reference image and text feedback to search for the desired images. Compared to general image retrieval, it canmodel the users' search intent more comprehensively and search the target images more accurately, which has significant impacts in various real-world applications, such as E-commerce and Internet search. However, because of the existing heterogeneous semantic gap, the synthetic understanding and fusion of both image and text are difficult to implement. In thiswork, to tackle this difficult problem, we propose an end-to-end framework MCR, which uses text and images as retrieval queries. The framework mainly includes four pivotal modules. Specifically, we introduce the Relative Caption-aware Consistency (RCC) constraint to align text pieces and images in the database, which can effectually bridge the heterogeneous gap. The Multi-modal Complementary Fusion (MCF) and Crossmodal Guided Pooling (CGP) are constructed to mine multiple interactions between image local features and text word features and learn the complementary representation of the composed query. Furthermore, we develop a plug-and-play Weak-text Semantic Augment (WSA) module for datasets with short or incomplete query texts, which can supplement the weak-text features and is conducive to modeling an augmented semantic space. Extensive experiments demonstrate the practical superior performance over the existing state-of-the-art empirical algorithms on several benchmarks.
引用
收藏
页码:6446 / 6457
页数:12
相关论文
共 50 条
  • [1] Heterogeneous Feature Fusion and Cross-modal Alignment for Composed Image Retrieval
    Zhang, Gangjian
    Wei, Shikui
    Pang, Huaxin
    Zhao, Yao
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5353 - 5362
  • [2] Cross-Modal Joint Prediction and Alignment for Composed Query Image Retrieval
    Yang, Yuchen
    Wang, Min
    Zhou, Wengang
    Li, Houqiang
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 3303 - 3311
  • [3] Geometry Sensitive Cross-Modal Reasoning for Composed Query Based Image Retrieval
    Zhang, Feifei
    Xu, Mingliang
    Xu, Changsheng
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 1000 - 1011
  • [4] Multi-Modal Transformer With Global-Local Alignment for Composed Query Image Retrieval
    Xu, Yahui
    Bin, Yi
    Wei, Jiwei
    Yang, Yang
    Wang, Guoqing
    Shen, Heng Tao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 8346 - 8357
  • [5] Composed Image Retrieval via Explicit Erasure and Replenishment With Semantic Alignment
    Zhang, Gangjian
    Wei, Shikui
    Pang, Huaxin
    Qiu, Shuang
    Zhao, Yao
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 5976 - 5988
  • [6] Fusion-Based Correlation Learning Model for Cross-Modal Remote Sensing Image Retrieval
    Lv, Yafei
    Xiong, Wei
    Zhang, Xiaohan
    Cui, Yaqi
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2022, 19
  • [7] Interacting-Enhancing Feature Transformer for Cross-Modal Remote-Sensing Image and Text Retrieval
    Tang, Xu
    Wang, Yijing
    Ma, Jingjing
    Zhang, Xiangrong
    Liu, Fang
    Jiao, Licheng
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [8] Deep Label Feature Fusion Hashing for Cross-Modal Retrieval
    Ren, Dongxiao
    Xu, Weihua
    Wang, Zhonghua
    Sun, Qinxiu
    IEEE ACCESS, 2022, 10 : 100276 - 100285
  • [9] A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing
    Cheng, Qimin
    Zhou, Yuzhuo
    Fu, Peng
    Xu, Yuan
    Zhang, Liang
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2021, 14 : 4284 - 4297
  • [10] Exploring Uni-Modal Feature Learning on Entities and Relations for Remote Sensing Cross-Modal Text-Image Retrieval
    Zhang, Shun
    Li, Yupeng
    Mei, Shaohui
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61