Zero-Shot Cross Modal Retrieval Method Based on Deep Supervised Learning

被引：0

作者：

Zeng S. ^{[1
]}

Pang S. ^{[1
]}

Hao W. ^{[1
]}

机构：

[1] Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an

来源：

Hsi-An Chiao Tung Ta Hsueh/Journal of Xi'an Jiaotong University | 2022年 / 56卷 / 11期

关键词：

attention; cross modal retrieval; matching; zero-shot;

D O I：

10.7652/xjtuxb202211016

中图分类号：

TB18 [人体工程学]; Q98 [人类学];

学科分类号：

030303 ; 1201 ;

摘要：

A novel zero-shot cross modal retrieval method based on deep supervised learning is proposed as category matching and corresponding matching are not considered in current research. Firstly, three types of image-text pairs are distinguished, including pairs from the same category that match correspondingly, pairs from the same category that do not match correspondingly, and pairs from different categories. Secondly, with the category of images and texts matched, to further realize the corresponding matching between them, two matching constraints are constructed based on different masking patterns. One is to mask samples that are of another modality and of the same category but do not match with each other, restraining the matching relations among images and texts of different categories. The other is to mask samples that are of another modality and of different categories, restraining the corresponding matching relations between images and texts of the same category. Finally, by aligning distribution structures of visual features and their corresponding semantic features in each space, the category matching and corresponding matching relations between images and texts are constrained again. In addition, to enhance the representation of text semantics, the attention mechanism is also utilized to obtain more significant sentence semantic features from word sequences. Experimental results show that on CUB dataset, the proposed method improves the image-based text retrieval and text-based image retrieval effects by 5.9% and 2.2% compared with baseline model respectively; on FLO dataset, these figures are 4.2% and 1.7% higher compared with the current best-performing methods, respectively. © 2022 Xi'an Jiaotong University. All rights reserved.

引用

页码：156 / 166

页数：10

共 25 条

[1]

YIN Qiyue, HUANG Yan, ZHANG Junge, Et al., Survey on deep learning based cross-modal retrieval, Journal of Image and Graphics, 26, 6, pp. 1368-1388, (2021)

[2]

LIU Ying, GUO Yingying, FANG Jie, Et al., Survey of research on deep learning image-text cross-modal retrieval, Journal of Frontiers of Computer Science and Technology, 16, 3, pp. 489-511, (2022)

[3]

HU Di, NIE Feiping, LI Xuelong, Deep binary reconstruction for cross-modal hashing [J], IEEE Transactions on Multimedia, 21, 4, pp. 973-985, (2019)

[4]

LI Huiqiong, WANG Yongxin, CHEN Zhenduo, Et al., Ranking-based supervised discrete cross-modal hashing, Chinese Journal of Computers, 44, 8, pp. 1620-1635, (2021)

[5]

WANG Liwei, LI Yin, HUANG Jing, Et al., Learning two-branch neural networks for image-text matching tasks [J], IEEE Transactions on Pattern Analysis and Machine Intelligence, 41, 2, pp. 394-407, (2019)

[6]

WANG Hongbin, ZHANG Zhiliang, LI Huafeng, Image-text cross-modal matching method based on stacked cross attention, Journal of Signal Processing, 38, 2, pp. 285-299, (2022)

[7]

CHEN Hui, DING Guiguang, LIN Zijia, Et al., Cross-modal image-text retrieval with semantic consistency, Proceedings of the 27th ACM International Conference on Multimedia, pp. 1749-1757, (2019)

[8]

REED S, AKATA Z, LEE H, Et al., Learning deep representations of fine-grained visual descriptions, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 49-58, (2016)

[9]

WANG Wei, ZHENG V W, YU Han, Et al., A survey of zero-shot learning: settings, methods, and applications [J], ACM Transactions on Intelligent Systems and Technology, 10, 2, (2019)

[10]

ZHANG Guimei, LONG Bangyao, ZENG Jiexian, Et al., Zero-shot attribute recognition based on deredundancy features and semantic relationship constraint, Pattern Recognition and Artificial Intelligence, 34, 9, pp. 809-823, (2021)

← 1 2 3 →