IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval

被引：249

作者：

Chen, Hui ^{[1
]}

Ding, Guiguang ^{[1
]}

Liu, Xudong ^{[2
]}

Lin, Zijia ^{[3
]}

Liu, Ji ^{[4
]}

Han, Jungong ^{[5
]}

机构：

[1] Tsinghua Univ, Sch Software, BNRist, Beijing, Peoples R China

[2] Kwai Ads Platform, Beijing, Peoples R China

[3] Microsoft Res, Redmond, WA USA

[4] Kwai Seattle AI Lab, Kwai FeDA Lab, Kwai AI Platform, Seattle, WA USA

[5] Univ Warwick, WMG Data Sci, Coventry, W Midlands, England

来源：

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020) | 2020年

基金：

中国国家自然科学基金;

关键词：

D O I：

10.1109/CVPR42600.2020.01267

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Enabling bi-directional retrieval of images and texts is important for understanding the correspondence between vision and language. Existing methods leverage the attention mechanism to explore such correspondence in a finegrained manner. However, most of them consider all semantics equally and thus align them uniformly, regardless of their diverse complexities. In fact, semantics are diverse (i.e. involving different kinds of semantic concepts), and humans usually follow a latent structure to combine them into understandable languages. It may be difficult to optimally capture such sophisticated correspondences in existing methods. In this paper, to address such a deficiency, we propose an Iterative Matching with Recurrent Attention Memory (IMRAM) method, in which correspondences between images and texts are captured with multiple steps of alignments. Specifically, we introduce an iterative matching scheme to explore such fine-grained correspondence progressively. A memory distillation unit is used to refine alignment knowledge from early steps to later ones. Experiment results on three benchmark datasets, i.e. Flickr8K, Flickr30K, and MS COCO, show that our IMRAM achieves state-of-the-art performance, well demonstrating its effectiveness. Experiments on a practical business advertisement dataset, named KWAI-AD, further validates the applicability of our method in practical scenarios.

引用

页码：12652 / 12660

页数：9

共 27 条

[1] Chen Y.-C, 2019, arXiv
[2] Chorowski J. K., 2015, ADV NEURAL INFORM PR, P577, DOI DOI 10.1016/0167-739X(94)90007-8
[3] Reinforcement Learning Based Monte Carlo Tree Search for Temporal Path Discovery
Ding, Pengfei
Liu, Guanfeng
Zhao, Pengpeng
Liu, An
Li, Zhixu
Zheng, Kai
[J]. 2019 19TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2019), 2019, : 140 - 149
[4] Faghri F., 2017, ARXIV
[5] Frome A, 2013, ADV NEURAL INFORM PR, P2121
[6] Guo Yuchen, 2015, 29 AAAI C ART INT
[7] Learning Semantic Concepts and Order for Image and Sentence Matching
Huang, Yan
Wu, Qi
Song, Chunfeng
Wang, Liang
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6163 - 6171
[8] Instance-aware Image and Sentence Matching with Selective Multimodal LSTM
Huang, Yan
Wang, Wei
Wang, Liang
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 7254 - 7262
[9] Cross-Modal Image-Text Retrieval with Semantic Consistency
Chen, Hui
Ding, Guiguang
Lin, Zijin
Zhao, Sicheng
Han, Jungong
[J]. PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1749 - 1757
[10] Saliency-Guided Attention Network for Image-Sentence Matching
Ji, Zhong
Wang, Haoran
Han, Jungong
Pang, Yanwei
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 5753 - 5762

← 1 2 3 →