Fusion layer attention for image-text matching

被引:7
作者
Wang, Depeng [1 ]
Wang, Liejun [2 ]
Song, Shiji [3 ]
Huang, Gao [3 ]
Guo, Yuchen [3 ]
Cheng, Shuli [2 ]
Ao, Naixiang [4 ]
Du, Anyu [2 ]
机构
[1] XinJiang Univ, Software Coll, Urumqi 830046, Peoples R China
[2] Xinjiang Univ, Informat Sci & Engn Coll, Urumqi 830046, Peoples R China
[3] Tsinghua Univ, Beijing 100084, Peoples R China
[4] China Acad Elect & Informat Technol, Xinjiang Lianhai INA INT Informat Technol Ltd, Urumqi 830000, Peoples R China
基金
美国国家科学基金会;
关键词
Deep learning; Image-text matching; Multimodal; Retrieval;
D O I
10.1016/j.neucom.2021.01.124
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image-text matching aims to find the relationship between image and text data and to establish a connection between them. The main challenge of image-text matching is the fact that images and texts have different data distributions and feature representations. Current methods for image-text matching fall into two basic types: methods that map image and text data into a common space and then use distance measurements and methods that treat image-text matching as a classification problem. In both cases, the two data modes used are image and text data. In our method, we create a fusion layer to extract intermediate modes, thus improving the image-text processing results. We also propose a concise way to update the loss function that makes it easier for neural networks to handle difficult problems. The proposed method was verified on the Flickr30K and MS-COCO datasets and achieved superior matching results compared to existing methods. (c) 2021 Elsevier B.V. All rights reserved.
引用
收藏
页码:249 / 259
页数:11
相关论文
共 26 条
[1]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[2]   Kernel-Based Mixture Mapping for Image and Text Association [J].
Du, Youtian ;
Wang, Xue ;
Cui, Yunbo ;
Wang, Hang ;
Su, Chang .
IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (02) :365-379
[3]  
Faghri F., ARXIV PREPRINT ARXIV
[4]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[5]   Learning Semantic Concepts and Order for Image and Sentence Matching [J].
Huang, Yan ;
Wu, Qi ;
Song, Chunfeng ;
Wang, Liang .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6163-6171
[6]   Instance-aware Image and Sentence Matching with Selective Multimodal LSTM [J].
Huang, Yan ;
Wang, Wei ;
Wang, Liang .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :7254-7262
[7]   Revisiting Visual Question Answering Baselines [J].
Jabri, Allan ;
Joulin, Armand ;
van der Maaten, Laurens .
COMPUTER VISION - ECCV 2016, PT VIII, 2016, 9912 :727-739
[8]  
Karpathy A, 2015, PROC CVPR IEEE, P3128, DOI 10.1109/CVPR.2015.7298932
[9]   Stacked Cross Attention for Image-Text Matching [J].
Lee, Kuang-Huei ;
Chen, Xi ;
Hua, Gang ;
Hu, Houdong ;
He, Xiaodong .
COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 :212-228
[10]   Visual Semantic Reasoning for Image-Text Matching [J].
Li, Kunpeng ;
Zhang, Yulun ;
Li, Kai ;
Li, Yuanyuan ;
Fu, Yun .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :4653-4661