Composed Image Retrieval via Cross Relation Network With Hierarchical Aggregation Transformer

被引：9

作者：

Yang, Qu ^{[1
]}

Ye, Mang ^{[1
]}

Cai, Zhaohui ^{[1
]}

Su, Kehua ^{[1
]}

Du, Bo ^{[1
]}

机构：

[1] Wuhan Univ, Natl Engn Res Ctr Multimedia Software, Sch Comp Sci, Hubei Luojia Lab, Wuhan 430072, Peoples R China

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2023年 / 32卷

基金：

中国国家自然科学基金;

关键词：

Cross Relation; Image Retrieval; Transformer;

D O I：

10.1109/TIP.2023.3299791

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Composing Text and Image to Image Retrieval (CTI-IR) aims at finding the target image, which matches the query image visually along with the query text semantically. However, existing works ignore the fact that the reference text usually serves multiple functions, e.g., modification and auxiliary. To address this issue, we put forth a unified solution, namely Hierarchical Aggregation Transformer incorporated with Cross Relation Network (CRN). CRN unifies modification and relevance manner in a single framework. This configuration shows broader applicability, enabling us to model both modification and auxiliary text or their combination in triplet relationships simultaneously. Specifically, CRN includes: 1) Cross Relation Network comprehensively captures the relationships of various composed retrieval scenarios caused by two different query text types, allowing a unified retrieval model to designate adaptive combination strategies for flexible applicability; 2) Hierarchical Aggregation Transformer aggregates top-down features with Multi-layer Perceptron (MLP) to overcome the limitations of edge information loss in a window-based multi-stage Transformer. Extensive experiments demonstrate the superiority of the proposed CRN over all three fashion-domain datasets. Code is available at github.com/yan9qu/crn.

引用

页码：4543 / 4554

页数：12

共 76 条

[1] Deep image retrieval using artificial neural network interpolation and indexing based on similarity measurement
Ahmad, Faiyaz
[J]. CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY, 2022, 7 (02) : 200 - 218
[2] Content based image retrieval using image features information fusion
Ahmed, Khawaja Tehseen
Ummesafi, Shahida
Iqbal, Amjad
[J]. INFORMATION FUSION, 2019, 51 : 76 - 99
[3] Guided Image-to-Image Translation with Bi-Directional Feature Transformation
AlBahar, Badour
Huang, Jia-Bin
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 9015 - 9024
[4] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Anderson, Peter
He, Xiaodong
Buehler, Chris
Teney, Damien
Johnson, Mark
Gould, Stephen
Zhang, Lei
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
[5] [Anonymous], 2020, Mindspore
[6] ViViT: A Video Vision Transformer
Arnab, Anurag
Dehghani, Mostafa
Heigold, Georg
Sun, Chen
Lucic, Mario
Schmid, Cordelia
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 6816 - 6826
[7] Berg TL, 2010, LECT NOTES COMPUT SC, V6311, P663, DOI 10.1007/978-3-642-15549-9_48
[8] Carion Nicolas, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12346), P213, DOI 10.1007/978-3-030-58452-8_13
[9] CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification
Chen, Chun-Fu
Fan, Quanfu
Panda, Rameswar
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 347 - 356
[10] Structure-Aware Positional Transformer for Visible-Infrared Person Re-Identification
Chen, Cuiqun
Ye, Mang
Qi, Meibin
Wu, Jingjing
Jiang, Jianguo
Lin, Chia-Wen
[J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 2352 - 2364

← 1 2 3 4 5 6 7 8 →