Graph Structured Network for Image-Text Matching

被引：194

作者：

Liu, Chunxiao ^{[1
,2
]}

Mao, Zhendong ^{[3
]}

Zhang, Tianzhu ^{[3
]}

Xie, Hongtao ^{[3
]}

Wang, Bin ^{[4
]}

Zhang, Yongdong ^{[3
]}

机构：

[1] Chinese Acad Sci, Inst Informat Engn, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Sch Cyber Secur, Beijing, Peoples R China

[3] Univ Sci & Technol China, Hefei, Peoples R China

[4] Xiaomi AI Lab, Beijing, Peoples R China

来源：

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020) | 2020年

基金：

中国国家自然科学基金;

关键词：

D O I：

10.1109/CVPR42600.2020.01093

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Image-text matching has received growing interest since it bridges vision and language. The key challenge lies in how to learn correspondence between image and text. Existing works learn coarse correspondence based on object co-occurrence statistics, while failing to learn fine-grained phrase correspondence. In this paper, we present a novel Graph Structured Matching Network (GSMN) to learn fine-grained correspondence. The GSMN explicitly models object, relation and attribute as a structured phrase, which not only allows to learn correspondence of object, relation and attribute separately, but also benefits to learn fine-grained correspondence of structured phrase. This is achieved by node-level matching and structure-level matching. The node-level matching associates each node with its relevant nodes from another modality, where the node can be object, relation or attribute. The associated nodes then jointly infer fine-grained correspondence by fusing neighborhood associations at structure-level matching. Comprehensive experiments show that GSMN outperforms state-of-the-art methods on benchmarks, with relative Recall@1 improvements of nearly 7% and 2% on Flickr30K and MSCOCO, respectively. Code will be released at: https://github.com/CrossmodalGroup/GSMN

引用

页码：10918 / 10927

页数：10

共 35 条

[1] Hierarchical Clustering Multi-Task Learning for Joint Human Action Grouping and Recognition [J].

Liu, An-An ;

Su, Yu-Ting ;

Nie, Wei-Zhi ;

Kankanhalli, Mohan .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (01) :102-114

[2] Linking Image and Text with 2-Way Nets [J].

Eisenschtat, Aviv ;

Wolf, Lior .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1855-1865

[3]

Faghri Fartash, 2018, BRIT MACH VIS C

[4] Stacked Latent Attention for Multimodal Reasoning [J].

Fan, Haoqi ;

Zhou, Jiatong .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1072-1080

[5] Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models [J].

Gu, Jiuxiang ;

Cai, Jianfei ;

Joty, Shafiq ;

Niu, Li ;

Wang, Gang .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7181-7189

[6] Deep Residual Learning for Image Recognition [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778

[7]

Herzig R, 2018, ADV NEUR IN, V31

[8] Bi-Directional Spatial-Semantic Attention Networks for Image-Text Matching [J].

Huang, Feiran ;

Zhang, Xiaoming ;

Zhao, Zhonghua ;

Li, Zhoujun .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2019, 28 (04) :2008-2020

[9] Learning Semantic Concepts and Order for Image and Sentence Matching [J].

Huang, Yan ;

Wu, Qi ;

Song, Chunfeng ;

Wang, Liang .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6163-6171

[10] Instance-aware Image and Sentence Matching with Selective Multimodal LSTM [J].

Huang, Yan ;

Wang, Wei ;

Wang, Liang .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :7254-7262

← 1 2 3 4 →