Composed Image Retrieval via Cross Relation Network With Hierarchical Aggregation Transformer

被引:9
作者
Yang, Qu [1 ]
Ye, Mang [1 ]
Cai, Zhaohui [1 ]
Su, Kehua [1 ]
Du, Bo [1 ]
机构
[1] Wuhan Univ, Natl Engn Res Ctr Multimedia Software, Sch Comp Sci, Hubei Luojia Lab, Wuhan 430072, Peoples R China
基金
中国国家自然科学基金;
关键词
Cross Relation; Image Retrieval; Transformer;
D O I
10.1109/TIP.2023.3299791
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Composing Text and Image to Image Retrieval (CTI-IR) aims at finding the target image, which matches the query image visually along with the query text semantically. However, existing works ignore the fact that the reference text usually serves multiple functions, e.g., modification and auxiliary. To address this issue, we put forth a unified solution, namely Hierarchical Aggregation Transformer incorporated with Cross Relation Network (CRN). CRN unifies modification and relevance manner in a single framework. This configuration shows broader applicability, enabling us to model both modification and auxiliary text or their combination in triplet relationships simultaneously. Specifically, CRN includes: 1) Cross Relation Network comprehensively captures the relationships of various composed retrieval scenarios caused by two different query text types, allowing a unified retrieval model to designate adaptive combination strategies for flexible applicability; 2) Hierarchical Aggregation Transformer aggregates top-down features with Multi-layer Perceptron (MLP) to overcome the limitations of edge information loss in a window-based multi-stage Transformer. Extensive experiments demonstrate the superiority of the proposed CRN over all three fashion-domain datasets. Code is available at github.com/yan9qu/crn.
引用
收藏
页码:4543 / 4554
页数:12
相关论文
共 76 条
  • [1] Deep image retrieval using artificial neural network interpolation and indexing based on similarity measurement
    Ahmad, Faiyaz
    [J]. CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY, 2022, 7 (02) : 200 - 218
  • [2] Content based image retrieval using image features information fusion
    Ahmed, Khawaja Tehseen
    Ummesafi, Shahida
    Iqbal, Amjad
    [J]. INFORMATION FUSION, 2019, 51 : 76 - 99
  • [3] Guided Image-to-Image Translation with Bi-Directional Feature Transformation
    AlBahar, Badour
    Huang, Jia-Bin
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 9015 - 9024
  • [4] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
    Anderson, Peter
    He, Xiaodong
    Buehler, Chris
    Teney, Damien
    Johnson, Mark
    Gould, Stephen
    Zhang, Lei
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
  • [5] [Anonymous], 2020, Mindspore
  • [6] ViViT: A Video Vision Transformer
    Arnab, Anurag
    Dehghani, Mostafa
    Heigold, Georg
    Sun, Chen
    Lucic, Mario
    Schmid, Cordelia
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 6816 - 6826
  • [7] Berg TL, 2010, LECT NOTES COMPUT SC, V6311, P663, DOI 10.1007/978-3-642-15549-9_48
  • [8] Carion Nicolas, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12346), P213, DOI 10.1007/978-3-030-58452-8_13
  • [9] CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification
    Chen, Chun-Fu
    Fan, Quanfu
    Panda, Rameswar
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 347 - 356
  • [10] Structure-Aware Positional Transformer for Visible-Infrared Person Re-Identification
    Chen, Cuiqun
    Ye, Mang
    Qi, Meibin
    Wu, Jingjing
    Jiang, Jianguo
    Lin, Chia-Wen
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 2352 - 2364