Cross-Lingual Adaptation for Vision-Language Model via Multimodal Semantic Distillation

被引:0
作者
Weng, Yu [1 ]
He, Wenbin [1 ]
Dong, Jun [1 ]
Chaomurilige, Xuan [1 ]
Liu, Xuan [1 ]
Liu, Zheng [1 ]
机构
[1] Minzu Univ China, Dept Informat Engn, Beijing 100081, Peoples R China
基金
中国国家社会科学基金;
关键词
Adaptation models; Multilingual; Visualization; Training; Semantics; Data models; Natural language processing; Translation; Large language models; Knowledge transfer; Cross-lingual vision-language understanding; efficient model adaptation; zero-shot learning;
D O I
10.1109/TMM.2025.3557678
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large Multimodal Models (LMMs) excel in English multimedia tasks but face challenges in adapting to other languages due to linguistic diversity, limited non-English multimodal data, and high training costs. Existing approaches rely on machine-translated multimodal corpora or multilingual large language models, yet they demand substantial resources and achieve only modest zero-shot cross-lingual transfer performance, as shown in the IGLUE benchmark. In this work, we propose SMSA, a Syntax-aware Multimodal Semantic Adaptation approach, which efficiently extends vision-language models (VLMs) to multiple languages via a lightweight adaptation module. Instead of learning from scratch, SMSA transfers multimodal knowledge from English-trained models using two key components: (1) a Syntax-aware Adapter (SAA), which restructures multilingual text representations to align better with English syntax, reducing cross-lingual misalignment; (2) a Multimodal Semantic Distillation (MSD) method, which enables the model to mimic English sequence processing and retain multimodal associations across languages. This allows efficient adaptation to new languages while preserving the original model's strong multimodal capabilities. We extend an MoE-based VLM to 8 languages using a small translation dataset. Evaluations on the IGLUE benchmark show that SMSA achieves strong zero-shot transfer, outperforming some multilingual LMMs and demonstrating its effectiveness in cross-lingual vision-language adaptation.
引用
收藏
页码:3184 / 3196
页数:13
相关论文
共 73 条
[1]   Principal component analysis [J].
Abdi, Herve ;
Williams, Lynne J. .
WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL STATISTICS, 2010, 2 (04) :433-459
[2]  
Achiam J., 2023, Open AI GPT-4 technical report, DOI [DOI 10.48550/ARXIV.2303.08774, 10.48550/arxiv.2303.08774]
[3]  
Bai JZ, 2023, Arxiv, DOI arXiv:2308.12966
[4]  
Bao HB, 2022, ADV NEUR IN
[5]  
Bapna A, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P1538
[6]  
Bugliarello E, 2022, PR MACH LEARN RES
[7]   Doubly-Attentive Decoder for Multi-modal Neural Machine Translation [J].
Calixto, Iacer ;
Liu, Qun ;
Campbell, Nick .
PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, :1913-1924
[8]   Indic Visual Question Answering [J].
Chandrasekar, Aditya ;
Shimpi, Amey ;
Naik, Dinesh .
2022 IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATIONS, SPCOM, 2022,
[9]  
Changpinyo S, 2023, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, P2667
[10]  
Chen X., 2023, P 11 INT C LEARN REP