SAC: Semantic Attention Composition for Text-Conditioned Image Retrieval

被引:27
作者
Jandial, Surgan [1 ,5 ]
Badjatiya, Pinkesh [1 ,2 ]
Chawla, Pranit [3 ]
Chopra, Ayush [1 ,4 ]
Sarkar, Mausoom [1 ]
Krishnamurthy, Balaji [1 ]
机构
[1] Adobe, Media & Data Sci Res Lab, San Jose, CA 95110 USA
[2] Microsoft, Hyderabad, India
[3] Indian Inst Technol, Kharagpur, W Bengal, India
[4] MIT, Media Lab, 77 Massachusetts Ave, Cambridge, MA 02139 USA
[5] Adobe, San Jose, CA USA
来源
2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022) | 2022年
关键词
D O I
10.1109/WACV51458.2022.00067
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The ability to efficiently search for images is essential for improving the user experiences across various products. Incorporating user feedback, via multi-modal inputs, to navigate visual search can help tailor retrieved results to specific user queries. We focus on the task of text-conditioned image retrieval that utilizes support text feedback alongside a reference image to retrieve images that concurrently satisfy constraints imposed by both inputs. The task is challenging since it requires learning composite image-text features by incorporating multiple cross-granular semantic edits from text feedback and then applying the same to visual features. To address this, we propose a novel framework SAC which resolves the above in two major steps: "where to see" (Semantic Feature Attention) and "how to change" (Semantic Feature Modification). We systematically show how our architecture streamlines the generation of text-aware image features by removing the need for various modules required by other state-of-art techniques. We present extensive quantitative, qualitative analysis, and ablation studies, to show that our architecture SAC outperforms existing techniques by achieving state-of-the-art performance on 3 benchmark datasets: FashionIQ, Shoes, and Birds-to-Words, while supporting natural language feedback of varying lengths.
引用
收藏
页码:597 / 606
页数:10
相关论文
共 45 条
[1]   Learning Attribute Representations with Localization for Flexible Fashion Search [J].
Ak, Kenan E. ;
Kassim, Ashraf A. ;
Lim, Joo Hwee ;
Tham, Jo Yew .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7708-7717
[2]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[3]  
Anwaar Muhammad Umer, ARXIV PREPRINT ARXIV
[4]  
Barman A., 2019, IEEE T PATTERN ANAL, P1
[5]   Learning visual similarity for product design with convolutional neural networks [J].
Bell, Sean ;
Bala, Kavita .
ACM TRANSACTIONS ON GRAPHICS, 2015, 34 (04)
[6]  
Berg TL, 2010, LECT NOTES COMPUT SC, V6311, P663, DOI 10.1007/978-3-642-15549-9_48
[7]   Image Search with Text Feedback by Visiolinguistic Attention Learning [J].
Chen, Yanbei ;
Gong, Shaogang ;
Bazzani, Loris .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :2998-3008
[8]  
Cho K, 2014, ARXIV14061078, P1724
[9]  
Chopra Ayush, 2019, P IEEECVF C COMPUTER, P2
[10]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848