CL2CM: Improving Cross-Lingual Cross-Modal Retrieval via Cross-Lingual Knowledge Transfer

被引:0
作者
Wang, Yabing [1 ,2 ,3 ,6 ]
Wang, Fan [3 ]
Dong, Jianfeng [1 ,5 ]
Luo, Hao [3 ,4 ]
机构
[1] Zhejiang Gongshang Univ, Hangzhou, Peoples R China
[2] Xi An Jiao Tong Univ, Xian, Peoples R China
[3] Alibaba Grp, DAMO Acad, Hangzhou, Peoples R China
[4] Hupan Lab, Hangzhou, Zhejiang, Peoples R China
[5] Zhejiang Key Lab E Commerce, Hangzhou, Peoples R China
[6] DAMO Acad, Hangzhou, Peoples R China
来源
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6 | 2024年
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Cross-lingual cross-modal retrieval has garnered increasing attention recently, which aims to achieve the alignment between vision and target language (V-T) without using any annotated V-T data pairs. Current methods employ machine translation (MT) to construct pseudo-parallel data pairs, which are then used to learn a multi-lingual and multi-modal embedding space that aligns visual and target-language representations. However, the large heterogeneous gap between vision and text, along with the noise present in target language translations, poses significant challenges in effectively aligning their representations. To address these challenges, we propose a general framework, Cross-Lingual to CrossModal (CL2CM), which improves the alignment between vision and target language using cross-lingual transfer. This approach allows us to fully leverage the merits of multilingual pre-trained models (e.g., mBERT) and the benefits of the same modality structure, i.e., smaller gap, to provide reliable and comprehensive semantic correspondence (knowledge) for the cross-modal network. We evaluate our proposed approach on two multilingual image-text datasets, Multi30K and MSCOCO, and one video-text dataset, VATEX. The results clearly demonstrate the effectiveness of our proposed method and its high potential for large-scale retrieval.
引用
收藏
页码:5651 / 5659
页数:9
相关论文
共 36 条
[1]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[2]  
Chang WX, 2022, ADV NEUR IN
[3]   Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning [J].
Chen, Shizhe ;
Zhao, Yida ;
Jin, Qin ;
Wu, Qi .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10635-10644
[4]  
Chen XL, 2015, Arxiv, DOI arXiv:1504.00325
[5]  
Devlin J, 2019, Arxiv, DOI arXiv:1810.04805
[6]  
Dong J., 2022, IEEE Transactions on Circuits and Systems for Video Technology
[7]   Dual Learning with Dynamic Knowledge Distillation for Partially Relevant Video Retrieval [J].
Dong, Jianfeng ;
Zhang, Minsong ;
Zhang, Zheng ;
Chen, Xianke ;
Liu, Daizong ;
Qu, Xiaoye ;
Wang, Xun ;
Liu, Baolong .
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, :11268-11278
[8]  
Elliott D, 2016, Arxiv, DOI arXiv:1605.00459
[9]   Compressing Visual-linguistic Model via Knowledge Distillation [J].
Fang, Zhiyuan ;
Wang, Jianfeng ;
Hu, Xiaowei ;
Wang, Lijuan ;
Yang, Yezhou ;
Liu, Zicheng .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1408-1418
[10]  
Hinton G., 2015, Distilling the knowledge in a neural network