SAC: Semantic Attention Composition for Text-Conditioned Image Retrieval

被引：28

作者：

Jandial, Surgan ^{[1
,5
]}

Badjatiya, Pinkesh ^{[1
,2
]}

Chawla, Pranit ^{[3
]}

Chopra, Ayush ^{[1
,4
]}

Sarkar, Mausoom ^{[1
]}

Krishnamurthy, Balaji ^{[1
]}

机构：

[1] Adobe, Media & Data Sci Res Lab, San Jose, CA 95110 USA

[2] Microsoft, Hyderabad, India

[3] Indian Inst Technol, Kharagpur, W Bengal, India

[4] MIT, Media Lab, 77 Massachusetts Ave, Cambridge, MA 02139 USA

[5] Adobe, San Jose, CA USA

来源：

2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022) | 2022年

关键词：

D O I：

10.1109/WACV51458.2022.00067

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The ability to efficiently search for images is essential for improving the user experiences across various products. Incorporating user feedback, via multi-modal inputs, to navigate visual search can help tailor retrieved results to specific user queries. We focus on the task of text-conditioned image retrieval that utilizes support text feedback alongside a reference image to retrieve images that concurrently satisfy constraints imposed by both inputs. The task is challenging since it requires learning composite image-text features by incorporating multiple cross-granular semantic edits from text feedback and then applying the same to visual features. To address this, we propose a novel framework SAC which resolves the above in two major steps: "where to see" (Semantic Feature Attention) and "how to change" (Semantic Feature Modification). We systematically show how our architecture streamlines the generation of text-aware image features by removing the need for various modules required by other state-of-art techniques. We present extensive quantitative, qualitative analysis, and ablation studies, to show that our architecture SAC outperforms existing techniques by achieving state-of-the-art performance on 3 benchmark datasets: FashionIQ, Shoes, and Birds-to-Words, while supporting natural language feedback of varying lengths.

引用

页码：597 / 606

页数：10

共 45 条

[1] Learning Attribute Representations with Localization for Flexible Fashion Search [J].

Ak, Kenan E. ;

Kassim, Ashraf A. ;

Lim, Joo Hwee ;

Tham, Jo Yew .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7708-7717

[2] VQA: Visual Question Answering [J].

Antol, Stanislaw ;

Agrawal, Aishwarya ;

Lu, Jiasen ;

Mitchell, Margaret ;

Batra, Dhruv ;

Zitnick, C. Lawrence ;

Parikh, Devi .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433

[3]

Anwaar Muhammad Umer, ARXIV PREPRINT ARXIV

[4]

Barman A., 2019, IEEE T PATTERN ANAL, P1, DOI 10.1109/ISBI.2019.87

[5] Learning visual similarity for product design with convolutional neural networks [J].

Bell, Sean ;

Bala, Kavita .

ACM TRANSACTIONS ON GRAPHICS, 2015, 34 (04)

[6]

Berg TL, 2010, LECT NOTES COMPUT SC, V6311, P663, DOI 10.1007/978-3-642-15549-9_48

[7] Image Search with Text Feedback by Visiolinguistic Attention Learning [J].

Chen, Yanbei ;

Gong, Shaogang ;

Bazzani, Loris .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :2998-3008

[8]

Cho K., 2014, ARXIV14061078, DOI [10.48550/arXiv.1406.1078, DOI 10.3115/V1/D14-1179]

[9]

Chopra Ayush, 2019, P IEEECVF C COMPUTER, P2

[10]

Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848

← 1 2 3 4 5 →