Bi-Directional Spatial-Semantic Attention Networks for Image-Text Matching

被引:74
作者
Huang, Feiran [1 ]
Zhang, Xiaoming [2 ]
Zhao, Zhonghua [3 ]
Li, Zhoujun [4 ]
机构
[1] Beihang Univ, Beijing Key Lab Network Technol, Beijing 100191, Peoples R China
[2] Beihang Univ, Sch Cyber Sci & Technol, Beijing 100191, Peoples R China
[3] Coordinat Ctr China, Natl Comp Emergency Tech Team, Beijing 100029, Peoples R China
[4] Beihang Univ, Sch Comp Sci & Engn, State Key Lab Software Dev Environm, Beijing 100191, Peoples R China
基金
北京市自然科学基金; 中国国家自然科学基金;
关键词
Image-text matching; attention networks; deep learning; spatial-semantic;
D O I
10.1109/TIP.2018.2882225
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image-text matching by deep models has recently made remarkable achievements in many tasks, such as image caption and image search. A major challenge of matching the image and text lies in that they usually have complicated underlying relations between them and simply modeling the relations may lead to suboptimal performance. In this paper, we develop a novel approach bi-directional spatial-semantic attention network, which leverages both the word to regions (W2R) relation and visual object to words (O2W) relation in a holistic deep framework for more effectively matching. Specifically, to effectively encode the W2R relation, we adopt LSTM with bilinear attention function to infer the image regions which are more related to the particular words, which is referred as the W2R attention networks. On the other side, the O2W attention networks are proposed to discover the semantically close words for each visual object in the image, i.e., the visual O2W relation. Then, a deep model unifying both of the two directional attention networks into a holistic learning framework is proposed to learn the matching scores of image and text pairs. Compared to the existing image-text matching methods, our approach achieves state-of-the-art performance on the datasets of Flickr30K and MSCOCO.
引用
收藏
页码:2008 / 2020
页数:13
相关论文
共 50 条
  • [31] Asymmetric Polysemous Reasoning for Image-Text Matching
    Zhang, Hongping
    Yang, Ming
    2023 23RD IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS, ICDMW 2023, 2023, : 1013 - 1022
  • [32] Electricity demand error corrections with attention bi-directional neural networks
    Ghimire, Sujan
    Deo, Ravinesh C.
    Casillas-Perez, David
    Salcedo-Sanz, Sancho
    ENERGY, 2024, 291
  • [33] Learning Two-Branch Neural Networks for Image-Text Matching Tasks
    Wang, Liwei
    Li, Yin
    Huang, Jing
    Lazebnik, Svetlana
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (02) : 394 - 407
  • [34] Giving Text More Imagination Space for Image-text Matching
    Dong, Xinfeng
    Han, Longfei
    Zhang, Dingwen
    Liu, Li
    Han, Junwei
    Zhang, Huaxiang
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 6359 - 6368
  • [35] Learning Relationship-Enhanced Semantic Graph for Fine-Grained Image-Text Matching
    Liu, Xin
    He, Yi
    Cheung, Yiu-Ming
    Xu, Xing
    Wang, Nannan
    IEEE TRANSACTIONS ON CYBERNETICS, 2024, 54 (02) : 948 - 961
  • [36] Towards Deconfounded Image-Text Matching with Causal Inference
    Li, Wenhui
    Su, Xinqi
    Song, Dan
    Wang, Lanjun
    Zhang, Kun
    Liu, An-An
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 6264 - 6273
  • [37] Generating counterfactual negative samples for image-text matching
    Su, Xinqi
    Song, Dan
    Li, Wenhui
    Ren, Tongwei
    Liu, An-An
    INFORMATION PROCESSING & MANAGEMENT, 2025, 62 (03)
  • [38] Deep Cross-Modal Projection Learning for Image-Text Matching
    Zhang, Ying
    Lu, Huchuan
    COMPUTER VISION - ECCV 2018, PT I, 2018, 11205 : 707 - 723
  • [39] FA-IATI: A Framework of Frequency Adaptive and Iterative Attention Interaction for Image-Text Matching
    Qin, Youxuan
    Zhao, Jing
    Li, Ming
    Sun, Chao
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [40] A NEIGHBOR-AWARE APPROACH FOR IMAGE-TEXT MATCHING
    Liu, Chunxiao
    Mao, Zhendong
    Zang, Wenyu
    Wang, Bin
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 3970 - 3974