Multi-modal graph contrastive encoding for neural machine translation

被引:8
作者
Yin, Yongjing [1 ]
Zeng, Jiali [2 ]
Su, Jinsong [1 ,5 ]
Zhou, Chulun [1 ]
Meng, Fandong [2 ]
Zhou, Jie [2 ]
Huang, Degen [3 ]
Luo, Jiebo [4 ]
机构
[1] Xiamen Univ, Xiamen, Fujian, Peoples R China
[2] Tencent Inc, Pattern Recognit Ctr, WeChat AI, Beijing 100080, Peoples R China
[3] Dalian Univ Technol, Dalian, Peoples R China
[4] Univ Rochester, Rochester, NY USA
[5] Xiamen Univ, Key Lab Digital Protect & Intelligent Proc Intangi, Minist Culture & Tourism, Xiamen, Peoples R China
基金
中国国家自然科学基金;
关键词
Multi -modal neural machine translation; Graph neural networks; Contrastive learning;
D O I
10.1016/j.artint.2023.103986
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
As an important extension of conventional text-only neural machine translation (NMT), multi-modal neural machine translation (MNMT) aims to translate input source sentences paired with images into the target language. Although a lot of MNMT models have been proposed to perform multi-modal semantic fusion, they do not consider fine-grained semantic correspondences between semantic units of different modalities (i.e., words and visual objects), which can be exploited to refine multi-modal representation learning via fine-grained semantic interactions. To address this issue, we propose a graph-based multi -modal fusion encoder for NMT. Concretely, we first employ a unified multi-modal graph to represent the input sentence and image, in which the multi-modal semantic units are considered as the nodes in the graph, connected by two kinds of edges with different semantic relationships. Then, we stack multiple graph-based multi-modal fusion layers that iteratively conduct intra-and inter-modal interactions to learn node representations. Finally, via an attention mechanism, we induce a multi-modal context from the top node representations for the decoder. Particularly, we introduce a progressive contrastive learning strategy based on the multi-modal graph to refine the training of our proposed model, where hard negative samples are introduced gradually. To evaluate our model, we conduct experiments on commonly-used datasets. Experimental results and analysis show that our MNMT model obtains significant improvements over competitive baselines, achieving state-of-the-art performance on the Multi30K dataset.& COPY; 2023 Elsevier B.V. All rights reserved.
引用
收藏
页数:14
相关论文
共 57 条
[1]  
Barrault Loic, 2018, P 3 C MACH TRANSL SH, V2, P308
[2]  
Beck D, 2018, PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, P273
[3]  
Caglayan O, 2016, Arxiv, DOI arXiv:1609.03976
[4]  
Caglayan O, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4159
[5]  
Caglayan Ozan, 2017, P 2 C MACH TRANSL AS, P432, DOI DOI 10.18653/V1/W17-4746
[6]  
Calixto I, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P6392
[7]   Doubly-Attentive Decoder for Multi-modal Neural Machine Translation [J].
Calixto, Iacer ;
Liu, Qun ;
Campbell, Nick .
PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, :1913-1924
[8]  
Calixto Iacer, 2017, P 2017 C EMPIRICAL M, P992
[9]  
Chen T, 2020, PR MACH LEARN RES, V119
[10]  
Delbrouck J.B., 2017, P 2017 C EMPIRICAL M, P910