Contrastive Learning Based Visual Representation Enhancement for Multimodal Machine Translation

被引:0
作者
Wang, Shike [1 ]
Zhang, Wen [2 ]
Guo, Wenyu [1 ]
Yu, Dong [1 ]
Liu, Pengyuan [1 ,3 ]
机构
[1] Beijing Language & Culture Univ, Sch Comp Sci, Beijing, Peoples R China
[2] Xiaomi AI Lab, Beijing, Peoples R China
[3] Beijing Language & Culture Univ, Natl Language Resources Monitoring & Res Print Me, Beijing, Peoples R China
来源
2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN) | 2022年
基金
北京市自然科学基金;
关键词
Multimodal; Machine Translation; Contrastive Learning;
D O I
10.1109/IJCNN55064.2022.9892312
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multimodal machine translation (MMT) is a task that incorporates extra image modality with text to translate. Previous works have worked on the interaction between two modalities and investigated the need of visual modality. However, few works focus on the models with better and more effective visual representation as input. We argue that the performance of MMT systems will get improved when better visual representation inputs into the systems. To investigate the thought, we introduce mT-ICL, a multimodal Transformer model with image contrastive learning. The contrastive objective is optimized to enhance the representation ability of the image encoder so that the encoder can generate better and more adaptive visual representation. Experiments show that our mT-ICL significantly outperforms the strong baseline and achieves the new SOTA on most of test sets of English-to-German and English-to-French. Further analysis reveals that visual modality works more than a regularization method under contrastive learning framework.
引用
收藏
页数:8
相关论文
共 40 条
  • [1] [Anonymous], 2018, P 3 C MACHINE TRANSL
  • [2] Ba Jimmy Lei, 2016, LAYER NORMALIZATION, DOI 10.48550/arXiv.1607.06450
  • [3] Caglayan O., 2017, LIUM CVC SUBMISSIONS
  • [4] Caglayan O., 2017, P 2 C MACHINE TRANSL, V2, P432
  • [5] Caglayan O, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4159
  • [6] Calixto I., 2017, P 2017 C EMP METH NA, P992, DOI DOI 10.18653/V1/D17-1105
  • [7] Caron M., 2020, NeurIPS
  • [8] Caron Mathilde, 2018, Deep clustering for unsupervised learning of visual features
  • [9] Chen T, 2020, PR MACH LEARN RES, V119
  • [10] Denkowski M., 2014, Proceedings of the ninth workshop on statistical machine translation, P376, DOI DOI 10.3115/V1/W14-3348