Context-Aware Multi-View Summarization Network for Image-Text Matching

被引：106

作者：

Qu, Leigang ^{[1
]}

Liu, Meng ^{[2
]}

Cao, Da ^{[3
]}

Nie, Liqiang ^{[1
]}

Tian, Qi ^{[4
]}

机构：

[1] Shandong Univ, Qingdao, Peoples R China

[2] Shandong Jianzhu Univ, Qingdao, Peoples R China

[3] Hunan Univ, Changsha, Peoples R China

[4] Huawei Cloud & AI, Changsha, Peoples R China

来源：

MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA | 2020年

基金：

中国国家自然科学基金;

关键词：

Image-Text Matching; Cross-Modal Retrieval; Multi-View Summarization; Context Modeling; LANGUAGE;

D O I：

10.1145/3394171.3413961

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Image-text matching is a vital yet challenging task in the field of multimedia analysis. Over the past decades, great efforts have been made to bridge the semantic gap between the visual and textual modalities. Despite the significance and value, most prior work is still confronted with a multi-view description challenge, i.e., how to align an image to multiple textual descriptions with semantic diversity. Toward this end, we present a novel context-aware multi-view summarization network to summarize context-enhanced visual region information from multiple views. To be more specific, we design an adaptive gating self-attention module to extract representations of visual regions and words. By controlling the internal information flow, we are able to adaptively capture context information. Afterwards, we introduce a summarization module with a diversity regularization to aggregate region-level features into image-level ones from different perspectives. Ultimately, we devise a multi-view matching scheme to match multi-view image features with corresponding text ones. To justify our work, we have conducted extensive experiments on two benchmark datasets, Le., Flickr30K and MS-COCO, which demonstrates the superiority of our model as compared to several state-of-the-art baselines.

引用

页码：1047 / 1055

页数：9

共 44 条

[1] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].

Anderson, Peter ;

He, Xiaodong ;

Buehler, Chris ;

Teney, Damien ;

Johnson, Mark ;

Gould, Stephen ;

Zhang, Lei .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086

[2]

[Anonymous], 2018, 41 INT ACM SIGIR C R, DOI DOI 10.1145/3209978.3210003

[3] VQA: Visual Question Answering [J].

Antol, Stanislaw ;

Agrawal, Aishwarya ;

Lu, Jiasen ;

Mitchell, Margaret ;

Batra, Dhruv ;

Zitnick, C. Lawrence ;

Parikh, Devi .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433

[4]

Devlin J., 2019, CORR, V1, P4171

[5]

Faghri Fartash, 2018, BRIT MACH VIS C

[6]

Frome A., 2013, Advances in neural information processing systems, V26, P2121

[7] Dynamic Fusion with Intra- and Inter-modality Attention Flow for Visual Question Answering [J].

Gao, Peng ;

Jiang, Zhengkai ;

You, Haoxuan ;

Lu, Pan ;

Hoi, Steven ;

Wang, Xiaogang ;

Li, Hongsheng .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :6632-6641

[8] Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models [J].

Gu, Jiuxiang ;

Cai, Jianfei ;

Joty, Shafiq ;

Niu, Li ;

Wang, Gang .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7181-7189

[9] Deep Residual Learning for Image Recognition [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778

[10]

Hu Y., 2020, MATH PROBL ENG, V2020, P1

← 1 2 3 4 5 →