Multi-level Alignment Network for Domain Adaptive Cross-modal Retrieval

被引：19

作者：

Dong, Jianfeng ^{[1
,4
]}

Long, Zhongzi ^{[2
]}

Mao, Xiaofeng ^{[3
]}

Lin, Changting ^{[1
,5
]}

He, Yuan ^{[3
]}

Ji, Shouling ^{[2
,4
]}

机构：

[1] Zhejiang Gongshang Univ, Hangzhou, Peoples R China

[2] Zhejiang Univ, Hangzhou, Peoples R China

[3] Alibaba Grp, Hangzhou, Peoples R China

[4] Alibaba Zhejiang Univ Joint Res Inst Frontier Tec, Hangzhou, Peoples R China

[5] Chinese Acad Sci, Inst Informat Engn, State Key Lab Informat Secur, Beijing, Peoples R China

来源：

NEUROCOMPUTING | 2021年 / 440卷

关键词：

Cross-modal retrieval; Domain adaptation; Cross-dataset training; Adversarial learning; IMAGE;

D O I：

10.1016/j.neucom.2021.01.114

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Cross-modal retrieval is an important but challenging research task in the multimedia community. Most existing works of this task are supervised, which typically train models on a large number of aligned image-text/video-text pairs, making an assumption that training and testing data are drawn from the same distribution. If this assumption does not hold, traditional cross-modal retrieval methods may experience a performance drop at the evaluation. In this paper, we introduce a new task named as domain adaptive cross-modal retrieval, where training (source) data and testing (target) data are from different domains. The task is challenging, as there are not only the semantic gap and modality gap between visual and textual items, but also domain gap between source and target domains. Therefore, we propose a Multi-level Alignment Network (MAN) that has two mapping modules to project visual and textual modalities in a common space respectively, and three alignments are used to learn more discriminative features in the space. A semantic alignment is used to reduce the semantic gap, a cross-modality alignment and a cross-domain alignment are employed to alleviate the modality gap and domain gap. Extensive experiments in the context of domain-adaptive image-text retrieval and video-text retrieval demonstrate that our proposed model, MAN, consistently outperforms multiple baselines, showing a superior generalization ability for target data. Moreover, MAN establishes a new state-of-the-art for the large-scale text-to video retrieval on TRECVID 2017, 2018 Ad-hoc Video Search benchmark. (c) 2021 Elsevier B.V. All rights reserved.

引用

页码：207 / 219

页数：13

共 79 条

[31]

Li X., 2020, IEEE T MULTIMEDIA

[32]

Li X., 2018, TRECVID

[33] W2VV++: Fully Deep Learning for Ad-hoc Video Search [J].

Li, Xirong ;

Xu, Chaoxi ;

Yang, Gang ;

Chen, Zhineng ;

Dong, Jianfeng .

PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, :1786-1794

[34] TGIF: A New Dataset and Benchmark on Animated GIF Description [J].

Li, Yuncheng ;

Song, Yale ;

Cao, Liangliang ;

Tetreault, Joel ;

Goldberg, Larry ;

Jaimes, Alejandro ;

Luo, Jiebo .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :4641-4650

[35]

Liang J., 2016, TRECVID WORKSH

[36] Microsoft COCO: Common Objects in Context [J].

Lin, Tsung-Yi ;

Maire, Michael ;

Belongie, Serge ;

Hays, James ;

Perona, Pietro ;

Ramanan, Deva ;

Dollar, Piotr ;

Zitnick, C. Lawrence .

COMPUTER VISION - ECCV 2014, PT V, 2014, 8693 :740-755

[37]

Liu Yang, 2019, BMVC

[38] DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations [J].

Liu, Ziwei ;

Luo, Ping ;

Qiu, Shi ;

Wang, Xiaogang ;

Tang, Xiaoou .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1096-1104

[39]

Long M, 2016, PROCEEDINGS OF SYMPOSIUM OF POLICING DIPLOMACY AND THE BELT & ROAD INITIATIVE, 2016, P136

[40]

Long MS, 2017, PR MACH LEARN RES, V70

← 1 2 3 4 5 6 7 8 →