Bi-Directional Spatial-Semantic Attention Networks for Image-Text Matching

被引：74

作者：

Huang, Feiran ^{[1
]}

Zhang, Xiaoming ^{[2
]}

Zhao, Zhonghua ^{[3
]}

Li, Zhoujun ^{[4
]}

机构：

[1] Beihang Univ, Beijing Key Lab Network Technol, Beijing 100191, Peoples R China

[2] Beihang Univ, Sch Cyber Sci & Technol, Beijing 100191, Peoples R China

[3] Coordinat Ctr China, Natl Comp Emergency Tech Team, Beijing 100029, Peoples R China

[4] Beihang Univ, Sch Comp Sci & Engn, State Key Lab Software Dev Environm, Beijing 100191, Peoples R China

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2019年 / 28卷 / 04期

基金：

北京市自然科学基金; 中国国家自然科学基金;

关键词：

Image-text matching; attention networks; deep learning; spatial-semantic;

D O I：

10.1109/TIP.2018.2882225

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Image-text matching by deep models has recently made remarkable achievements in many tasks, such as image caption and image search. A major challenge of matching the image and text lies in that they usually have complicated underlying relations between them and simply modeling the relations may lead to suboptimal performance. In this paper, we develop a novel approach bi-directional spatial-semantic attention network, which leverages both the word to regions (W2R) relation and visual object to words (O2W) relation in a holistic deep framework for more effectively matching. Specifically, to effectively encode the W2R relation, we adopt LSTM with bilinear attention function to infer the image regions which are more related to the particular words, which is referred as the W2R attention networks. On the other side, the O2W attention networks are proposed to discover the semantically close words for each visual object in the image, i.e., the visual O2W relation. Then, a deep model unifying both of the two directional attention networks into a holistic learning framework is proposed to learn the matching scores of image and text pairs. Compared to the existing image-text matching methods, our approach achieves state-of-the-art performance on the datasets of Flickr30K and MSCOCO.

引用

页码：2008 / 2020

页数：13

共 50 条

[41] Plug-and-Play Regulators for Image-Text Matching
Diao, Haiwen
Zhang, Ying
Liu, Wei
Ruan, Xiang
Lu, Huchuan
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 2322 - 2334
[42] News Image-Text Matching With News Knowledge Graph
Zhao Yumeng
Yun Jing
Gao Shuo
Liu Limin
IEEE ACCESS, 2021, 9 : 108017 - 108027
[43] Synthesizing Counterfactual Samples for Effective Image-Text Matching
Wei, Hao
Wang, Shuhui
Han, Xinzhe
Xue, Zhe
Ma, Bin
Wei, Xiaoming
Wei, Xiaolin
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4355 - 4364
[44] Generative label fused network for image-text matching
Zhao, Guoshuai
Zhang, Chaofeng
Shang, Heng
Wang, Yaxiong
Zhu, Li
Qian, Xueming
KNOWLEDGE-BASED SYSTEMS, 2023, 263
[45] Attention-Based Modality-Gated Networks for Image-Text Sentiment Analysis
Huang, Feiran
Wei, Kaimin
Weng, Jian
Li, Zhoujun
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2020, 16 (03)
[46] BIDMIR: BI-DIRECTIONAL MEDICAL IMAGE REGISTRATION WITH SYMMETRIC ATTENTION AND CYCLIC CONSISTENCY REGULARIZATION
Gao, Xiaoru
Tao, Rong
Zheng, Guoyan
2022 IEEE INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING (IEEE ISBI 2022), 2022,
[47] Globally Guided Confidence Enhancement Network for Image-Text Matching
Dai, Xin
Tuerhong, Gulanbaier
Wushouer, Mairidan
APPLIED SCIENCES-BASEL, 2023, 13 (09):
[48] Learning to Embed Semantic Similarity for Joint Image-Text Retrieval
Malali, Noam
Keller, Yosi
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) : 10252 - 10260
[49] Reference-Aware Adaptive Network for Image-Text Matching
Xiong, Guoxin
Meng, Meng
Zhang, Tianzhu
Zhang, Dongming
Zhang, Yongdong
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (10) : 9678 - 9691
[50] An Image-Text Matching Method for Multi-Modal Robots
Zheng, Ke
Li, Zhou
JOURNAL OF ORGANIZATIONAL AND END USER COMPUTING, 2024, 36 (01)

← 1 2 3 4 5 →