Multi-Scale Fine-Grained Alignments for Image and Sentence Matching

被引：36

作者：

Li, Wenhui ^{[1
]}

Wang, Yan ^{[1
]}

Su, Yuting

Li, Xuanya ^{[4
]}

Liu, An-An ^{[1
,2
,3
]}

Zhang, Yongdong ^{[5
]}

机构：

[1] Tianjin Univ, Sch Elect & Informat Engn, Tianjin 300072, Peoples R China

[2] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, Hefei 230000, Peoples R China

[3] Chinese Acad Sci, Key Lab Electromagnet SpaceInformat, Beijing 100000, Peoples R China

[4] Baidu Inc, Beijing 100000, Peoples R China

[5] Univ Sci & Technol China, Hefei 230027, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2023年 / 25卷

关键词：

Semantics; Visualization; Dogs; Mouth; Task analysis; Feature extraction; Bridges; Bi-directional aggregations; image and sentence matching; multi-scale alignments; NETWORK;

D O I：

10.1109/TMM.2021.3128744

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Image and sentence matching is a critical task to bridge the visual and textual discrepancy due to the heterogeneous modalities. Great progress has been made by exploring the coarse-grained relationships between images and sentences or fine-grained relationships between regions and words. However, how to fully excavate and exploit corresponding relations between these two modalities is still challenging. In this work, we propose a novel Multi-scale Fine-grained Alignments Network (MFA), which can effectively explore multi-scale visual-textual correspondences to facilitate bridging cross-modal discrepancy. Specifically, word-scale matching module is firstly utilized to mine the basic but fundamental correspondences between a single word and independent region. Then, we propose a phrase-scale matching module to explore the relations between objects with the constraint of attribute and corresponding region, which can further reserve more associated information. To cope with the complex interactions among multiple phrases and images, we design the relation-scale matching module to capture high-order semantics between two modalities. Moreover, each matching module includes visual aggregation and textual aggregations, which can ensure the bi-directional coupling of multi-scale semantics. Extensive qualitative and quantitative experiments on two challenging datasets including Flickr30 K and MSCOCO, show that the proposed method achieves superior performance compared with the existing methods.

引用

页码：543 / 556

页数：14

共 65 条

[21] Stacked Cross Attention for Image-Text Matching [J].

Lee, Kuang-Huei ;

Chen, Xi ;

Hua, Gang ;

Hu, Houdong ;

He, Xiaodong .

COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 :212-228

[22] Visual-Texual Emotion Analysis With Deep Coupled Video and Danmu Neural Networks [J].

Li, Chenchen ;

Wang, Jialin ;

Wang, Hongwei ;

Zhao, Miao ;

Li, Wenjie ;

Deng, Xiaotie .

IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (06) :1634-1646

[23] Visual Semantic Reasoning for Image-Text Matching [J].

Li, Kunpeng ;

Zhang, Yulun ;

Li, Kai ;

Li, Yuanyuan ;

Fu, Yun .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :4653-4661

[24] Identity-Aware Textual-Visual Matching with Latent Co-attention [J].

Li, Shuang ;

Xiao, Tong ;

Li, Hongsheng ;

Yang, Wei ;

Wang, Xiaogang .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :1908-1917

[25] Visual-Semantic Matching by Exploring High-Order Attention and Distraction [J].

Li, Yongzhi ;

Zhang, Duo ;

Mu, Yadong .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :12783-12792

[26] Microsoft COCO: Common Objects in Context [J].

Lin, Tsung-Yi ;

Maire, Michael ;

Belongie, Serge ;

Hays, James ;

Perona, Pietro ;

Ramanan, Deva ;

Dollar, Piotr ;

Zitnick, C. Lawrence .

COMPUTER VISION - ECCV 2014, PT V, 2014, 8693 :740-755

[27]

Liu C., 2020, P IEEECVF C COMPUTER, P10921

[28] Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching [J].

Liu, Chunxiao ;

Mao, Zhendong ;

Liu, An-An ;

Zhang, Tianzhu ;

Wang, Bin ;

Zhang, Yongdong .

PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, :3-11

[29] Sentiment Recognition for Short Annotated GIFs Using Visual-Textual Fusion [J].

Liu, Tianliang ;

Wan, Junwei ;

Dai, Xiubin ;

Liu, Feng ;

You, Quanzeng ;

Luo, Jiebo .

IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (04) :1098-1110

[30] Social Relation Recognition from Videos via Multi-scale Spatial-Temporal Reasoning [J].

Liu, Xinchen ;

Liu, Wu ;

Zhang, Meng ;

Chen, Jingwen ;

Gao, Lianli ;

Yan, Chenggang ;

Mei, Tao .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :3561-3569

← 1 2 3 4 5 6 7 →