CL2CM: Improving Cross-Lingual Cross-Modal Retrieval via Cross-Lingual Knowledge Transfer

被引：0

作者：

Wang, Yabing ^{[1
,2
,3
,6
]}

Wang, Fan ^{[3
]}

Dong, Jianfeng ^{[1
,5
]}

Luo, Hao ^{[3
,4
]}

机构：

[1] Zhejiang Gongshang Univ, Hangzhou, Peoples R China

[2] Xi An Jiao Tong Univ, Xian, Peoples R China

[3] Alibaba Grp, DAMO Acad, Hangzhou, Peoples R China

[4] Hupan Lab, Hangzhou, Zhejiang, Peoples R China

[5] Zhejiang Key Lab E Commerce, Hangzhou, Peoples R China

[6] DAMO Acad, Hangzhou, Peoples R China

来源：

THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6 | 2024年

基金：

中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Cross-lingual cross-modal retrieval has garnered increasing attention recently, which aims to achieve the alignment between vision and target language (V-T) without using any annotated V-T data pairs. Current methods employ machine translation (MT) to construct pseudo-parallel data pairs, which are then used to learn a multi-lingual and multi-modal embedding space that aligns visual and target-language representations. However, the large heterogeneous gap between vision and text, along with the noise present in target language translations, poses significant challenges in effectively aligning their representations. To address these challenges, we propose a general framework, Cross-Lingual to CrossModal (CL2CM), which improves the alignment between vision and target language using cross-lingual transfer. This approach allows us to fully leverage the merits of multilingual pre-trained models (e.g., mBERT) and the benefits of the same modality structure, i.e., smaller gap, to provide reliable and comprehensive semantic correspondence (knowledge) for the cross-modal network. We evaluate our proposed approach on two multilingual image-text datasets, Multi30K and MSCOCO, and one video-text dataset, VATEX. The results clearly demonstrate the effectiveness of our proposed method and its high potential for large-scale retrieval.

引用

页码：5651 / 5659

页数：9

共 36 条

[11]

Huang PY, 2021, Arxiv, DOI arXiv:2103.08849

[12]

Huang Zhenyu, 2021, Advances in Neural Information Processing Systems, V34

[13]

Jain A, 2021, Arxiv, DOI arXiv:2109.05125

[14] Partially Relevant Video Retrieval [J].

Dong, Jianfeng ;

Chen, Xianke ;

Zhang, Minsong ;

Yang, Xun ;

Chen, Shujie ;

Li, Xirong ;

Wang, Xun .

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022,

[15]

Kim Jae Myung, 2023, P IEEE CVF C COMP VI, P2584

[16] COCO-CN for Cross-Lingual Image Tagging, Captioning, and Retrieval [J].

Li, Xirong ;

Xu, Chaoxi ;

Wang, Xiaoxu ;

Lan, Weiyu ;

Jia, Zhengxiong ;

Yang, Gang ;

Xu, Jieping .

IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (09) :2347-2360

[17] SViTT: Temporal Learning of Sparse Video-Text Transformers [J].

Li, Yi ;

Min, Kyle ;

Tripathi, Subarna ;

Vasconcelos, Nuno .

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :18919-18929

[18] FeatInter: Exploring fine-grained object features for video-text retrieval [J].

Liu, Baolong ;

Zheng, Qi ;

Wang, Yabing ;

Zhang, Minsong ;

Dong, Jianfeng ;

Wang, Xun .

NEUROCOMPUTING, 2022, 496 :178-191

[19] CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning [J].

Luo, Huaishao ;

Ji, Lei ;

Zhong, Ming ;

Chen, Yang ;

Lei, Wen ;

Duan, Nan ;

Li, Tianrui .

NEUROCOMPUTING, 2022, 508 :293-304

[20] M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training [J].

Ni, Minheng ;

Huang, Haoyang ;

Su, Lin ;

Cui, Edward ;

Bharti, Taroon ;

Wang, Lijuan ;

Zhang, Dongdong ;

Duan, Nan .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :3976-3985

← 1 2 3 4 →