Enhancing Separate Encoding with Multi-layer Feature Alignment for Image-Text Matching

被引:0
作者
Wen, Keyu [1 ]
Li, Linyang [2 ]
Gu, Xiaodong [1 ]
机构
[1] Fudan Univ, Sch Informat Sci & Technol, Dept Elect Engn, Shanghai 200438, Peoples R China
[2] Fudan Univ, Sch Comp Sci, Shanghai Key Lab Intelligent Informat Proc, Shanghai 200438, Peoples R China
来源
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2021, PT I | 2021年 / 12891卷
基金
中国国家自然科学基金;
关键词
Image-text matching; Separate encoding; Cross modal;
D O I
10.1007/978-3-030-86362-3_33
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
There is a surge of interest in cross-modal representation learning, concerning mainly images and texts. Image-Text Matching task is one major challenge in cross-modal tasks. Traditional methods use multi-paths to encode features across modalities separately and project them into a shared latent space. Recently, the development of pre-trained models inspires people to learn cross-modal features jointly and boost performances through large-scale data. However, traditional methods are less effective when both modalities use pre-trained uni-modal encoders. Methods that encode features jointly would face an unacceptable calculation cost during inference, thus less valuable for real-time applications. In this paper, we first explore the pros and cons of these methods, then we propose an enhanced separate encoding framework, using an extra encoding process to project multi-layer features of pre-trained encoders into a similar latent space. Experiments show that our framework outperforms current methods that do not use large-scale image-text pairs in both Flickr30K and MS-COCO datasets while maintaining minimal cost during inference.
引用
收藏
页码:403 / 414
页数:12
相关论文
共 28 条
[1]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[2]  
Bowman Samuel R., 2015, P 2015 C EMP METH NA, P632, DOI 10.18653/v1/D15-1075
[3]  
Chen Xinlei, 2015, CORR
[4]  
Chen Y, 2019, arXiv
[5]   Bridging multimedia heterogeneity gap via Graph Representation Learning for cross-modal retrieval [J].
Cheng, Qingrong ;
Gu, Xiaodong .
NEURAL NETWORKS, 2021, 134 :143-162
[6]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[7]   FINDING STRUCTURE IN TIME [J].
ELMAN, JL .
COGNITIVE SCIENCE, 1990, 14 (02) :179-211
[8]  
Faghri Fartash, 2017, arXiv
[9]   Fast R-CNN [J].
Girshick, Ross .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :1440-1448
[10]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778