Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders

被引：88

作者：

Messina, Nicola ^{[1
]}

Amato, Giuseppe ^{[1
]}

Esuli, Andrea ^{[1
]}

Falchi, Fabrizio ^{[1
]}

Gennaro, Claudio ^{[1
]}

Marchand-Maillet, Stephane ^{[2
]}

机构：

[1] ISTI CNR, Pisa, Italy

[2] Univ Geneva, VIPER Grp, Geneva, Switzerland

来源：

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS | 2021年 / 17卷 / 04期

基金：

欧盟地平线“2020”;

关键词：

Deep learning; cross-modal retrieval; multi-modal matching; computer vision; natural language processing; ATTENTION; GENOME;

D O I：

10.1145/3451390

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal matching remains a challenging task. In this work, we tackle the task of cross-modal retrieval through image-sentence matching based on word-region alignments, using supervision only at the global image-sentence level. Specifically, we present a novel approach called Transformer Encoder Reasoning and Alignment Network (TERAN). TERAN enforces a fine-grained match between the underlying components of images and sentences (i.e., image regions and words, respectively) to preserve the informative richness of both modalities. TERAN obtains state-of-the-art results on the image retrieval task on both MS-COCO and Flickr30k datasets. Moreover, on MS-COCO, it also outperforms current approaches on the sentence retrieval task. 000Focusing on scalable cross-modal information retrieval, TERAN is designed to keep the visual and textual data pipelines well separated. Cross-attention links invalidate any chance to separately extract visual and textual features needed for the online search and the offline indexing steps in large-scale retrieval systems. In this respect, TERAN merges the information from the two domains only during the final alignment phase, immediately before the loss computation. We argue that the fine-grained alignments produced by TERAN pave the way toward the research for effective and efficient methods for large-scale cross-modal information retrieval. We compare the effectiveness of our approach against relevant state-of-the-art methods. On the MS-COCO 1K test set, we obtain an improvement of 5.7% and 3.5% respectively on the image and the sentence retrieval tasks on the Recall@1 metric.

引用

页数：23

共 65 条

[1] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].

Anderson, Peter ;

He, Xiaodong ;

Buehler, Chris ;

Teney, Damien ;

Johnson, Mark ;

Gould, Stephen ;

Zhang, Lei .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086

[2] SPICE: Semantic Propositional Image Caption Evaluation [J].

Anderson, Peter ;

Fernando, Basura ;

Johnson, Mark ;

Gould, Stephen .

COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 :382-398

[3]

[Anonymous], 2018, P EUR C COMP VIS ECC, DOI DOI 10.3892/MMR.2018.9013

[4]

[Anonymous], 2020, ARXIV200414255, DOI DOI 10.1145/3397271.3401093

[5] Picture it in your mind: generating high level visual representations from textual descriptions [J].

Carrara, Fabio ;

Esuli, Andrea ;

Fagni, Tiziano ;

Falchi, Fabrizio ;

Fernandez, Alejandro Moreo .

INFORMATION RETRIEVAL JOURNAL, 2018, 21 (2-3) :208-229

[6] IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval [J].

Chen, Hui ;

Ding, Guiguang ;

Liu, Xudong ;

Lin, Zijia ;

Liu, Ji ;

Han, Jungong .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :12652-12660

[7]

Chen Y, 2019, arXiv

[8] Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions [J].

Cornia, Marcella ;

Baraldi, Lorenzo ;

Cucchiara, Rita .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :8299-8308

[9]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[10] Linking Image and Text with 2-Way Nets [J].

Eisenschtat, Aviv ;

Wolf, Lior .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1855-1865

← 1 2 3 4 5 6 7 →