Fusion layer attention for image-text matching

被引：7

作者：

Wang, Depeng ^{[1
]}

Wang, Liejun ^{[2
]}

Song, Shiji ^{[3
]}

Huang, Gao ^{[3
]}

Guo, Yuchen ^{[3
]}

Cheng, Shuli ^{[2
]}

Ao, Naixiang ^{[4
]}

Du, Anyu ^{[2
]}

机构：

[1] XinJiang Univ, Software Coll, Urumqi 830046, Peoples R China

[2] Xinjiang Univ, Informat Sci & Engn Coll, Urumqi 830046, Peoples R China

[3] Tsinghua Univ, Beijing 100084, Peoples R China

[4] China Acad Elect & Informat Technol, Xinjiang Lianhai INA INT Informat Technol Ltd, Urumqi 830000, Peoples R China

来源：

NEUROCOMPUTING | 2021年 / 442卷

基金：

美国国家科学基金会;

关键词：

Deep learning; Image-text matching; Multimodal; Retrieval;

D O I：

10.1016/j.neucom.2021.01.124

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Image-text matching aims to find the relationship between image and text data and to establish a connection between them. The main challenge of image-text matching is the fact that images and texts have different data distributions and feature representations. Current methods for image-text matching fall into two basic types: methods that map image and text data into a common space and then use distance measurements and methods that treat image-text matching as a classification problem. In both cases, the two data modes used are image and text data. In our method, we create a fusion layer to extract intermediate modes, thus improving the image-text processing results. We also propose a concise way to update the loss function that makes it easier for neural networks to handle difficult problems. The proposed method was verified on the Flickr30K and MS-COCO datasets and achieved superior matching results compared to existing methods. (c) 2021 Elsevier B.V. All rights reserved.

引用

页码：249 / 259

页数：11

共 26 条

[1] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].

Anderson, Peter ;

He, Xiaodong ;

Buehler, Chris ;

Teney, Damien ;

Johnson, Mark ;

Gould, Stephen ;

Zhang, Lei .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086

[2] Kernel-Based Mixture Mapping for Image and Text Association [J].

Du, Youtian ;

Wang, Xue ;

Cui, Yunbo ;

Wang, Hang ;

Su, Chang .

IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (02) :365-379

[3]

Faghri F., ARXIV PREPRINT ARXIV

[4] Deep Residual Learning for Image Recognition [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778

[5] Learning Semantic Concepts and Order for Image and Sentence Matching [J].

Huang, Yan ;

Wu, Qi ;

Song, Chunfeng ;

Wang, Liang .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6163-6171

[6] Instance-aware Image and Sentence Matching with Selective Multimodal LSTM [J].

Huang, Yan ;

Wang, Wei ;

Wang, Liang .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :7254-7262

[7] Revisiting Visual Question Answering Baselines [J].

Jabri, Allan ;

Joulin, Armand ;

van der Maaten, Laurens .

COMPUTER VISION - ECCV 2016, PT VIII, 2016, 9912 :727-739

[8]

Karpathy A, 2015, PROC CVPR IEEE, P3128, DOI 10.1109/CVPR.2015.7298932

[9] Stacked Cross Attention for Image-Text Matching [J].

Lee, Kuang-Huei ;

Chen, Xi ;

Hua, Gang ;

Hu, Houdong ;

He, Xiaodong .

COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 :212-228

[10] Visual Semantic Reasoning for Image-Text Matching [J].

Li, Kunpeng ;

Zhang, Yulun ;

Li, Kai ;

Li, Yuanyuan ;

Fu, Yun .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :4653-4661

← 1 2 3 →