Transformer-Enhanced Visual-Semantic Representation for Text-Image Retrieval

被引:0
作者
Zhang, Meng [1 ]
Wu, Wei [1 ]
Zhang, Haotian [1 ]
机构
[1] Inner Mongolia Univ, Hohhot 010021, Peoples R China
来源
2022 34TH CHINESE CONTROL AND DECISION CONFERENCE, CCDC | 2022年
基金
中国国家自然科学基金;
关键词
Cross-modal retrieval; Transformer; Graph Structure and Deep learning;
D O I
10.1109/CCDC55256.2022.10033832
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text-image retrieval aims to find semantically relevant results from another modality when given an image or a sentence as a query. The key challenge is that the data representations of different modalities are inconsistent, so it is difficult to directly measure the similarity. The existing attention-based methods generally only consider the pairwise association between text and image, and ignore the semantic association between the different regions in an image which causes the learned image features or text features are insufficient at the abstract semantic level. In this paper, we propose a Transformer-Enhanced visual-semantic Representation Model (TERM) for text-image retrieval by designing Transformer-Enhanced (TE) module which play an important role in mining the context relationships between the local regions in images or between the words in sentences, therefore can provide more fine-grained clues for image and text matching. In addition, a graph structure is introduced in our method to effectively model the high-level semantics, which can accurately measure the similarity between the image and text. Experiments on Flickr30K and MSCOCO datasets demonstrate the effectiveness of our model, which achieves the state-of-the-art result for text-image retrieval. The Recall@1 on Flickr30K by our model improves image to text retrieval by 3.8% and text to image retrieval by 4.6%.
引用
收藏
页码:2042 / 2048
页数:7
相关论文
共 31 条
[1]  
Ba J. L., 2016, Advances in Neural Information Processing Systems (NeurIPS), P1
[2]  
Chung J, 2014, ARXIV
[3]  
Diao H., 2021, ARXIV210101368
[4]  
Faghri Fartash, 2017, arXiv
[5]   Artificial neural networks (the multilayer perceptron) - A review of applications in the atmospheric sciences [J].
Gardner, MW ;
Dorling, SR .
ATMOSPHERIC ENVIRONMENT, 1998, 32 (14-15) :2627-2636
[6]  
Ging Simon, 2020, ARXIV201100597
[7]   Learning Semantic Concepts and Order for Image and Sentence Matching [J].
Huang, Yan ;
Wu, Qi ;
Song, Chunfeng ;
Wang, Liang .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6163-6171
[8]  
Karpathy A, 2015, PROC CVPR IEEE, P3128, DOI 10.1109/CVPR.2015.7298932
[9]  
King DB, 2015, ACS SYM SER, V1214, P1, DOI 10.1021/bk-2015-1214.ch001
[10]  
Kiros R, 2014, PR MACH LEARN RES, V32, P595