Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders

被引:88
作者
Messina, Nicola [1 ]
Amato, Giuseppe [1 ]
Esuli, Andrea [1 ]
Falchi, Fabrizio [1 ]
Gennaro, Claudio [1 ]
Marchand-Maillet, Stephane [2 ]
机构
[1] ISTI CNR, Pisa, Italy
[2] Univ Geneva, VIPER Grp, Geneva, Switzerland
基金
欧盟地平线“2020”;
关键词
Deep learning; cross-modal retrieval; multi-modal matching; computer vision; natural language processing; ATTENTION; GENOME;
D O I
10.1145/3451390
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal matching remains a challenging task. In this work, we tackle the task of cross-modal retrieval through image-sentence matching based on word-region alignments, using supervision only at the global image-sentence level. Specifically, we present a novel approach called Transformer Encoder Reasoning and Alignment Network (TERAN). TERAN enforces a fine-grained match between the underlying components of images and sentences (i.e., image regions and words, respectively) to preserve the informative richness of both modalities. TERAN obtains state-of-the-art results on the image retrieval task on both MS-COCO and Flickr30k datasets. Moreover, on MS-COCO, it also outperforms current approaches on the sentence retrieval task. 000Focusing on scalable cross-modal information retrieval, TERAN is designed to keep the visual and textual data pipelines well separated. Cross-attention links invalidate any chance to separately extract visual and textual features needed for the online search and the offline indexing steps in large-scale retrieval systems. In this respect, TERAN merges the information from the two domains only during the final alignment phase, immediately before the loss computation. We argue that the fine-grained alignments produced by TERAN pave the way toward the research for effective and efficient methods for large-scale cross-modal information retrieval. We compare the effectiveness of our approach against relevant state-of-the-art methods. On the MS-COCO 1K test set, we obtain an improvement of 5.7% and 3.5% respectively on the image and the sentence retrieval tasks on the Recall@1 metric.
引用
收藏
页数:23
相关论文
共 65 条
[1]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[2]   SPICE: Semantic Propositional Image Caption Evaluation [J].
Anderson, Peter ;
Fernando, Basura ;
Johnson, Mark ;
Gould, Stephen .
COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 :382-398
[3]  
[Anonymous], 2018, P EUR C COMP VIS ECC, DOI DOI 10.3892/MMR.2018.9013
[4]  
[Anonymous], 2020, ARXIV200414255, DOI DOI 10.1145/3397271.3401093
[5]   Picture it in your mind: generating high level visual representations from textual descriptions [J].
Carrara, Fabio ;
Esuli, Andrea ;
Fagni, Tiziano ;
Falchi, Fabrizio ;
Fernandez, Alejandro Moreo .
INFORMATION RETRIEVAL JOURNAL, 2018, 21 (2-3) :208-229
[6]   IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval [J].
Chen, Hui ;
Ding, Guiguang ;
Liu, Xudong ;
Lin, Zijia ;
Liu, Ji ;
Han, Jungong .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :12652-12660
[7]  
Chen Y, 2019, arXiv
[8]   Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions [J].
Cornia, Marcella ;
Baraldi, Lorenzo ;
Cucchiara, Rita .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :8299-8308
[9]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[10]   Linking Image and Text with 2-Way Nets [J].
Eisenschtat, Aviv ;
Wolf, Lior .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1855-1865