Using Text to Teach Image Retrieval

被引：3

作者：

Dong, Haoyu ^{[1
]}

Wang, Ze ^{[2
]}

Qiu, Qiang ^{[2
]}

Sapiro, Guillermo ^{[1
]}

机构：

[1] Duke Univ, Durham, NC 27706 USA

[2] Purdue Univ, W Lafayette, IN 47907 USA

来源：

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021 | 2021年

关键词：

D O I：

10.1109/CVPRW53098.2021.00180

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Image retrieval relies heavily on the quality of the data modeling and the distance measurement in the feature space. Building on the concept of image manifold, we first propose to represent the feature space of images, learned via neural networks, as a graph. Neighborhoods in the feature space are now defined by the geodesic distance between images, represented as graph vertices or manifold samples. When limited images are available, this manifold is sparsely sampled, making the geodesic computation and the corresponding retrieval harder. To address this, we augment the manifold samples with geometrically aligned text, thereby using a plethora of sentences to teach us about images. In addition to extensive results on standard datasets illustrating the power of text to help in image retrieval, a new public dataset based on CLEVR is introduced to quantify the semantic similarity between visual data and text data. The experimental results show that the joint embedding manifold is a robust representation, allowing it to be a better basis to perform image retrieval given only an image and a textual instruction on the desired modifications over the image.

引用

页码：1643 / 1652

页数：10

共 38 条

[1] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].

Anderson, Peter ;

He, Xiaodong ;

Buehler, Chris ;

Teney, Damien ;

Johnson, Mark ;

Gould, Stephen ;

Zhang, Lei .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086

[2]

[Anonymous], 2016, ICML

[3]

[Anonymous], 2017, ARXIV170403162

[4]

[Anonymous], 2017, CVPR

[5] VQA: Visual Question Answering [J].

Antol, Stanislaw ;

Agrawal, Aishwarya ;

Lu, Jiasen ;

Mitchell, Margaret ;

Batra, Dhruv ;

Zitnick, C. Lawrence ;

Parikh, Devi .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433

[6] Semi-supervised learning on Riemannian manifolds [J].

Belkin, M ;

Niyogi, P .

MACHINE LEARNING, 2004, 56 (1-3) :209-239

[7] A METHOD FOR REGISTRATION OF 3-D SHAPES [J].

BESL, PJ ;

MCKAY, ND .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1992, 14 (02) :239-256

[8]

Cer D, 2018, CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018): PROCEEDINGS OF SYSTEM DEMONSTRATIONS, P169

[9]

Chapelle O., 2002, P INT C NEUR INF PRO, P601

[10]

Faghri F., 2018, P BRIT MACH VIS C BM

← 1 2 3 4 →