Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders

被引:74
作者
Messina, Nicola [1 ]
Amato, Giuseppe [1 ]
Esuli, Andrea [1 ]
Falchi, Fabrizio [1 ]
Gennaro, Claudio [1 ]
Marchand-Maillet, Stephane [2 ]
机构
[1] ISTI CNR, Pisa, Italy
[2] Univ Geneva, VIPER Grp, Geneva, Switzerland
基金
欧盟地平线“2020”;
关键词
Deep learning; cross-modal retrieval; multi-modal matching; computer vision; natural language processing; LANGUAGE; GENOME;
D O I
10.1145/3451390
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal matching remains a challenging task. In this work, we tackle the task of cross-modal retrieval through image-sentence matching based on word-region alignments, using supervision only at the global image-sentence level. Specifically, we present a novel approach called Transformer Encoder Reasoning and Alignment Network (TERAN). TERAN enforces a fine-grained match between the underlying components of images and sentences (i.e., image regions and words, respectively) to preserve the informative richness of both modalities. TERAN obtains state-of-the-art results on the image retrieval task on both MS-COCO and Flickr30k datasets. Moreover, on MS-COCO, it also outperforms current approaches on the sentence retrieval task. 000Focusing on scalable cross-modal information retrieval, TERAN is designed to keep the visual and textual data pipelines well separated. Cross-attention links invalidate any chance to separately extract visual and textual features needed for the online search and the offline indexing steps in large-scale retrieval systems. In this respect, TERAN merges the information from the two domains only during the final alignment phase, immediately before the loss computation. We argue that the fine-grained alignments produced by TERAN pave the way toward the research for effective and efficient methods for large-scale cross-modal information retrieval. We compare the effectiveness of our approach against relevant state-of-the-art methods. On the MS-COCO 1K test set, we obtain an improvement of 5.7% and 3.5% respectively on the image and the sentence retrieval tasks on the Recall@1 metric.
引用
收藏
页数:23
相关论文
共 65 条
  • [1] Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
    Anderson, Peter
    Wu, Qi
    Teney, Damien
    Bruce, Jake
    Johnson, Mark
    Sunderhauf, Niko
    Reid, Ian
    Gould, Stephen
    van den Hengel, Anton
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 3674 - 3683
  • [2] SPICE: Semantic Propositional Image Caption Evaluation
    Anderson, Peter
    Fernando, Basura
    Johnson, Mark
    Gould, Stephen
    [J]. COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 : 382 - 398
  • [3] [Anonymous], 2018, P EUR C COMP VIS ECC, DOI DOI 10.3892/MMR.2018.9013
  • [4] [Anonymous], 2020, ARXIV200414255, DOI DOI 10.1145/3397271.3401093
  • [5] Picture it in your mind: generating high level visual representations from textual descriptions
    Carrara, Fabio
    Esuli, Andrea
    Fagni, Tiziano
    Falchi, Fabrizio
    Fernandez, Alejandro Moreo
    [J]. INFORMATION RETRIEVAL JOURNAL, 2018, 21 (2-3): : 208 - 229
  • [6] Chen Y.-C., 2019, arXiv
  • [7] Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions
    Cornia, Marcella
    Baraldi, Lorenzo
    Cucchiara, Rita
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 8299 - 8308
  • [8] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [9] Linking Image and Text with 2-Way Nets
    Eisenschtat, Aviv
    Wolf, Lior
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 1855 - 1865
  • [10] Faghri F, 2018, P BRIT MACH VIS C