Graph Structured Network for Image-Text Matching

被引:194
作者
Liu, Chunxiao [1 ,2 ]
Mao, Zhendong [3 ]
Zhang, Tianzhu [3 ]
Xie, Hongtao [3 ]
Wang, Bin [4 ]
Zhang, Yongdong [3 ]
机构
[1] Chinese Acad Sci, Inst Informat Engn, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Cyber Secur, Beijing, Peoples R China
[3] Univ Sci & Technol China, Hefei, Peoples R China
[4] Xiaomi AI Lab, Beijing, Peoples R China
来源
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020) | 2020年
基金
中国国家自然科学基金;
关键词
D O I
10.1109/CVPR42600.2020.01093
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image-text matching has received growing interest since it bridges vision and language. The key challenge lies in how to learn correspondence between image and text. Existing works learn coarse correspondence based on object co-occurrence statistics, while failing to learn fine-grained phrase correspondence. In this paper, we present a novel Graph Structured Matching Network (GSMN) to learn fine-grained correspondence. The GSMN explicitly models object, relation and attribute as a structured phrase, which not only allows to learn correspondence of object, relation and attribute separately, but also benefits to learn fine-grained correspondence of structured phrase. This is achieved by node-level matching and structure-level matching. The node-level matching associates each node with its relevant nodes from another modality, where the node can be object, relation or attribute. The associated nodes then jointly infer fine-grained correspondence by fusing neighborhood associations at structure-level matching. Comprehensive experiments show that GSMN outperforms state-of-the-art methods on benchmarks, with relative Recall@1 improvements of nearly 7% and 2% on Flickr30K and MSCOCO, respectively. Code will be released at: https://github.com/CrossmodalGroup/GSMN
引用
收藏
页码:10918 / 10927
页数:10
相关论文
共 35 条
[1]   Hierarchical Clustering Multi-Task Learning for Joint Human Action Grouping and Recognition [J].
Liu, An-An ;
Su, Yu-Ting ;
Nie, Wei-Zhi ;
Kankanhalli, Mohan .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (01) :102-114
[2]   Linking Image and Text with 2-Way Nets [J].
Eisenschtat, Aviv ;
Wolf, Lior .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1855-1865
[3]  
Faghri Fartash, 2018, BRIT MACH VIS C
[4]   Stacked Latent Attention for Multimodal Reasoning [J].
Fan, Haoqi ;
Zhou, Jiatong .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1072-1080
[5]   Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models [J].
Gu, Jiuxiang ;
Cai, Jianfei ;
Joty, Shafiq ;
Niu, Li ;
Wang, Gang .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7181-7189
[6]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[7]  
Herzig R, 2018, ADV NEUR IN, V31
[8]   Bi-Directional Spatial-Semantic Attention Networks for Image-Text Matching [J].
Huang, Feiran ;
Zhang, Xiaoming ;
Zhao, Zhonghua ;
Li, Zhoujun .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2019, 28 (04) :2008-2020
[9]   Learning Semantic Concepts and Order for Image and Sentence Matching [J].
Huang, Yan ;
Wu, Qi ;
Song, Chunfeng ;
Wang, Liang .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6163-6171
[10]   Instance-aware Image and Sentence Matching with Selective Multimodal LSTM [J].
Huang, Yan ;
Wang, Wei ;
Wang, Liang .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :7254-7262