Multi-Grained Attention Network With Mutual Exclusion for Composed Query-Based Image Retrieval

被引：14

作者：

Li, Shenshen ^{[1
,2
]}

Xu, Xing ^{[1
,2
]}

Jiang, Xun ^{[1
,2
]}

Shen, Fumin ^{[1
,2
]}

Liu, Xin ^{[3
]}

Shen, Heng Tao ^{[4
,5
,6
]}

机构：

[1] Univ Elect Sci & Technol China, Ctr Future Multimedia, Chengdu 610051, Peoples R China

[2] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 610051, Peoples R China

[3] Huaqiao Univ, Sch Comp, Xiamen 361021, Peoples R China

[4] Univ Elect Sci & Technol China, Ctr Future Media, Chengdu 611731, Peoples R China

[5] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 611731, Peoples R China

[6] Peng Cheng Lab, Shenzhen 518055, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2024年 / 34卷 / 04期

基金：

中国国家自然科学基金;

关键词：

Semantics; Image retrieval; Task analysis; Feature extraction; Visualization; Fuses; Electronic mail; Composed query-based image retrieval; multi-grained semantic construction; mutual exclusion; preserved and modified attentions;

D O I：

10.1109/TCSVT.2023.3306738

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

The Composed Query-Based Image Retrieval (CQBIR) task aims to precisely obtain the preserved and modified parts, based on the multi-grained semantics learned from the composed query. Since the composed query includes a reference image and the modification text, not just a single modality, this task is more challenging than the general image retrieval tasks. Most previous methods attempt to learn preserved and modified parts via different attention modules and fuse them as a unified representation. However, these methods have two intrinsic drawbacks: 1) The different granular semantic information of the composed query is neglected, which results in the fact that learned preserved and modified parts are irrelevant to correct semantics. 2) The preserved and modified parts learned by previous methods have obvious overlaps, which may lead the model to obtain sub-optimal preserved and modified regions. To this end, we propose a novel method termed Multi-Grained Attention Network with Mutual Exclusion (MANME) to address the above problems. Our MANME method mainly consists of two components: 1) A multi-grained semantic construction for obtaining various textual and visual semantic information. 2) An attention with mutual exclusion constraint for reducing the degree of overlap between preserved and modified parts. It adequately utilizes the various granular semantic information and effectively refines the learned preserved and modified parts. Extensive experiments and further analyses on three widely used CQBIR datasets demonstrate that our proposed MANME method achieves new state-of-the-art performance on the CQBIR task.

引用

页码：2959 / 2972

页数：14

共 57 条

[1] VQA: Visual Question Answering [J].

Antol, Stanislaw ;

Agrawal, Aishwarya ;

Lu, Jiasen ;

Mitchell, Margaret ;

Batra, Dhruv ;

Zitnick, C. Lawrence ;

Parikh, Devi .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433

[2] Compositional Learning of Image-Text Query for Image Retrieval [J].

Anwaar, Muhammad Umer ;

Labintcev, Egor ;

Kleinsteuber, Martin .

2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2021), 2021, :1139-1148

[3] Deep Attention Neural Tensor Network for Visual Question Answering [J].

Bai, Yalong ;

Fu, Jianlong ;

Zhao, Tiejun ;

Mei, Tao .

COMPUTER VISION - ECCV 2018, PT XII, 2018, 11216 :21-37

[4]

Berg TL, 2010, LECT NOTES COMPUT SC, V6311, P663, DOI 10.1007/978-3-642-15549-9_48

[5]

Carion Nicolas, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12346), P213, DOI 10.1007/978-3-030-58452-8_13

[6] Leveraging Style and Content features for Text Conditioned Image Retrieval [J].

Chawla, Pranit ;

Jandial, Surgan ;

Badjatiya, Pinkesh ;

Chopra, Ayush ;

Sarkar, Mausoom ;

Krishnamurthy, Balaji .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, :3973-3977

[7]

Chen YB, 2020, Img Proc Comp Vis Re, V12367, P136, DOI 10.1007/978-3-030-58542-6_9

[8] Image Search with Text Feedback by Visiolinguistic Attention Learning [J].

Chen, Yanbei ;

Gong, Shaogang ;

Bazzani, Loris .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :2998-3008

[9]

Cho K., 2014, P C EMP METH NAT LAN, P1724

[10]

Delmas G, 2022, Arxiv, DOI arXiv:2203.08101

← 1 2 3 4 5 6 →