IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval

被引:249
作者
Chen, Hui [1 ]
Ding, Guiguang [1 ]
Liu, Xudong [2 ]
Lin, Zijia [3 ]
Liu, Ji [4 ]
Han, Jungong [5 ]
机构
[1] Tsinghua Univ, Sch Software, BNRist, Beijing, Peoples R China
[2] Kwai Ads Platform, Beijing, Peoples R China
[3] Microsoft Res, Redmond, WA USA
[4] Kwai Seattle AI Lab, Kwai FeDA Lab, Kwai AI Platform, Seattle, WA USA
[5] Univ Warwick, WMG Data Sci, Coventry, W Midlands, England
来源
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020) | 2020年
基金
中国国家自然科学基金;
关键词
D O I
10.1109/CVPR42600.2020.01267
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Enabling bi-directional retrieval of images and texts is important for understanding the correspondence between vision and language. Existing methods leverage the attention mechanism to explore such correspondence in a finegrained manner. However, most of them consider all semantics equally and thus align them uniformly, regardless of their diverse complexities. In fact, semantics are diverse (i.e. involving different kinds of semantic concepts), and humans usually follow a latent structure to combine them into understandable languages. It may be difficult to optimally capture such sophisticated correspondences in existing methods. In this paper, to address such a deficiency, we propose an Iterative Matching with Recurrent Attention Memory (IMRAM) method, in which correspondences between images and texts are captured with multiple steps of alignments. Specifically, we introduce an iterative matching scheme to explore such fine-grained correspondence progressively. A memory distillation unit is used to refine alignment knowledge from early steps to later ones. Experiment results on three benchmark datasets, i.e. Flickr8K, Flickr30K, and MS COCO, show that our IMRAM achieves state-of-the-art performance, well demonstrating its effectiveness. Experiments on a practical business advertisement dataset, named KWAI-AD, further validates the applicability of our method in practical scenarios.
引用
收藏
页码:12652 / 12660
页数:9
相关论文
共 27 条
  • [1] Chen Y.-C, 2019, arXiv
  • [2] Chorowski J. K., 2015, ADV NEURAL INFORM PR, P577, DOI DOI 10.1016/0167-739X(94)90007-8
  • [3] Reinforcement Learning Based Monte Carlo Tree Search for Temporal Path Discovery
    Ding, Pengfei
    Liu, Guanfeng
    Zhao, Pengpeng
    Liu, An
    Li, Zhixu
    Zheng, Kai
    [J]. 2019 19TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2019), 2019, : 140 - 149
  • [4] Faghri F., 2017, ARXIV
  • [5] Frome A, 2013, ADV NEURAL INFORM PR, P2121
  • [6] Guo Yuchen, 2015, 29 AAAI C ART INT
  • [7] Learning Semantic Concepts and Order for Image and Sentence Matching
    Huang, Yan
    Wu, Qi
    Song, Chunfeng
    Wang, Liang
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6163 - 6171
  • [8] Instance-aware Image and Sentence Matching with Selective Multimodal LSTM
    Huang, Yan
    Wang, Wei
    Wang, Liang
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 7254 - 7262
  • [9] Cross-Modal Image-Text Retrieval with Semantic Consistency
    Chen, Hui
    Ding, Guiguang
    Lin, Zijin
    Zhao, Sicheng
    Han, Jungong
    [J]. PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1749 - 1757
  • [10] Saliency-Guided Attention Network for Image-Sentence Matching
    Ji, Zhong
    Wang, Haoran
    Han, Jungong
    Pang, Yanwei
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 5753 - 5762