Dual-View Curricular Optimal Transport for Cross-Lingual Cross-Modal Retrieval

被引:9
作者
Wang, Yabing [1 ,2 ,3 ]
Wang, Shuhui [4 ]
Luo, Hao [5 ,6 ]
Dong, Jianfeng [3 ,7 ]
Wang, Fan [5 ]
Han, Meng [8 ]
Wang, Xun [3 ,7 ]
Wang, Meng [9 ]
机构
[1] Xi An Jiao Tong Univ, Natl Key Lab Human Machine Hybrid Augmented Intell, Natl Engn Res Ctr Visual Informat & Applicat, Xian 710049, Shaanxi, Peoples R China
[2] Xi An Jiao Tong Univ, Inst Artificial Intelligence & Robot, Xian 710049, Shaanxi, Peoples R China
[3] Zhejiang Gongshang Univ, Coll Comp Sci & Technol, Hangzhou 310035, Peoples R China
[4] Chinese Acad Sci, Inst Comp Technol, Beijing 100190, Peoples R China
[5] Alibaba Grp, Hangzhou 310052, Peoples R China
[6] Hupan Lab, Hangzhou 310058, Zhejiang, Peoples R China
[7] Zhejiang Key Lab E Commerce, Zhoushan 311121, Peoples R China
[8] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou 310058, Peoples R China
[9] Hefei Univ Technol, Sch Comp Sci & Informat Engn, Hefei 230009, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Noise measurement; Estimation; Costs; Transportation; Training; Task analysis; Cross-modal retrieval; noise correspondence learning; cross-lingual transfer; optimal transport; machine translation;
D O I
10.1109/TIP.2024.3365248
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current research on cross-modal retrieval is mostly English-oriented, as the availability of a large number of English-oriented human-labeled vision-language corpora. In order to break the limit of non-English labeled data, cross-lingual cross-modal retrieval (CCR) has attracted increasing attention. Most CCR methods construct pseudo-parallel vision-language corpora via Machine Translation (MT) to achieve cross-lingual transfer. However, the translated sentences from MT are generally imperfect in describing the corresponding visual contents. Improperly assuming the pseudo-parallel data are correctly correlated will make the networks overfit to the noisy correspondence. Therefore, we propose Dual-view Curricular Optimal Transport (DCOT) to learn with noisy correspondence in CCR. In particular, we quantify the confidence of the sample pair correlation with optimal transport theory from both the cross-lingual and cross-modal views, and design dual-view curriculum learning to dynamically model the transportation costs according to the learning stage of the two views. Extensive experiments are conducted on two multilingual image-text datasets and one video-text dataset, and the results demonstrate the effectiveness and robustness of the proposed method. Besides, our proposed method also shows a good expansibility to cross-lingual image-text baselines and a decent generalization on out-of-domain data.
引用
收藏
页码:1522 / 1533
页数:12
相关论文
共 48 条
[1]  
Aggarwal Pranav, 2020, ARXIV
[2]  
Arpit D, 2017, PR MACH LEARN RES, V70
[3]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[4]  
Chen Liqun, 2019, 25 AMERICAS C INFORM
[5]   Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning [J].
Chen, Shizhe ;
Zhao, Yida ;
Jin, Qin ;
Wu, Qi .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10635-10644
[6]  
Chen X., 2015, ARXIV
[7]  
Chen Y., 2023, arXiv
[8]  
Conneau Alexis, 2017, arXiv
[9]  
Cuturi Marco., 2013, Advances in Neural Information Processing Systems, V26, P2292
[10]   An Entropic Optimal Transport loss for learning deep neural networks under label noise in remote sensing images [J].
Damodaran, Bharath Bhushan ;
Flamary, Remi ;
Seguy, Vivien ;
Courty, Nicolas .
COMPUTER VISION AND IMAGE UNDERSTANDING, 2020, 191