Multi-Scale Fine-Grained Alignments for Image and Sentence Matching

被引：36

作者：

Li, Wenhui ^{[1
]}

Wang, Yan ^{[1
]}

Su, Yuting

Li, Xuanya ^{[4
]}

Liu, An-An ^{[1
,2
,3
]}

Zhang, Yongdong ^{[5
]}

机构：

[1] Tianjin Univ, Sch Elect & Informat Engn, Tianjin 300072, Peoples R China

[2] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, Hefei 230000, Peoples R China

[3] Chinese Acad Sci, Key Lab Electromagnet SpaceInformat, Beijing 100000, Peoples R China

[4] Baidu Inc, Beijing 100000, Peoples R China

[5] Univ Sci & Technol China, Hefei 230027, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2023年 / 25卷

关键词：

Semantics; Visualization; Dogs; Mouth; Task analysis; Feature extraction; Bridges; Bi-directional aggregations; image and sentence matching; multi-scale alignments; NETWORK;

D O I：

10.1109/TMM.2021.3128744

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Image and sentence matching is a critical task to bridge the visual and textual discrepancy due to the heterogeneous modalities. Great progress has been made by exploring the coarse-grained relationships between images and sentences or fine-grained relationships between regions and words. However, how to fully excavate and exploit corresponding relations between these two modalities is still challenging. In this work, we propose a novel Multi-scale Fine-grained Alignments Network (MFA), which can effectively explore multi-scale visual-textual correspondences to facilitate bridging cross-modal discrepancy. Specifically, word-scale matching module is firstly utilized to mine the basic but fundamental correspondences between a single word and independent region. Then, we propose a phrase-scale matching module to explore the relations between objects with the constraint of attribute and corresponding region, which can further reserve more associated information. To cope with the complex interactions among multiple phrases and images, we design the relation-scale matching module to capture high-order semantics between two modalities. Moreover, each matching module includes visual aggregation and textual aggregations, which can ensure the bi-directional coupling of multi-scale semantics. Extensive qualitative and quantitative experiments on two challenging datasets including Flickr30 K and MSCOCO, show that the proposed method achieves superior performance compared with the existing methods.

引用

页码：543 / 556

页数：14

共 65 条

[1] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].

Anderson, Peter ;

He, Xiaodong ;

Buehler, Chris ;

Teney, Damien ;

Johnson, Mark ;

Gould, Stephen ;

Zhang, Lei .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086

[2]

Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, 10.48550/arXiv.1409.0473,1409.0473, DOI 10.48550/ARXIV.1409.0473,1409.0473]

[3] Joint Visual Semantic Reasoning: Multi-Stage Decoder for Text Recognition [J].

Bhunia, Ayan Kumar ;

Sain, Aneeshan ;

Kumar, Amandeep ;

Ghose, Shuvozit ;

Chowdhury, Pinaki Nath ;

Song, Yi-Zhe .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :14920-14929

[4] Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for RGB-D salient object detection [J].

Chen, Hao ;

Li, Youfu ;

Su, Dan .

PATTERN RECOGNITION, 2019, 86 :376-385

[5] IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval [J].

Chen, Hui ;

Ding, Guiguang ;

Liu, Xudong ;

Lin, Zijia ;

Liu, Ji ;

Han, Jungong .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :12652-12660

[6]

Faghri F., 2018, P BRIT MACH VIS C BM, P1

[7] Stacked Latent Attention for Multimodal Reasoning [J].

Fan, Haoqi ;

Zhou, Jiatong .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1072-1080

[8]

Frome A., 2013, P NIPS, V26

[9] Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models [J].

Gu, Jiuxiang ;

Cai, Jianfei ;

Joty, Shafiq ;

Niu, Li ;

Wang, Gang .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7181-7189

[10] Deep Residual Learning for Image Recognition [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778

← 1 2 3 4 5 6 7 →