Multi-modal graph contrastive encoding for neural machine translation

被引：8

作者：

Yin, Yongjing ^{[1
]}

Zeng, Jiali ^{[2
]}

Su, Jinsong ^{[1
,5
]}

Zhou, Chulun ^{[1
]}

Meng, Fandong ^{[2
]}

Zhou, Jie ^{[2
]}

Huang, Degen ^{[3
]}

Luo, Jiebo ^{[4
]}

机构：

[1] Xiamen Univ, Xiamen, Fujian, Peoples R China

[2] Tencent Inc, Pattern Recognit Ctr, WeChat AI, Beijing 100080, Peoples R China

[3] Dalian Univ Technol, Dalian, Peoples R China

[4] Univ Rochester, Rochester, NY USA

[5] Xiamen Univ, Key Lab Digital Protect & Intelligent Proc Intangi, Minist Culture & Tourism, Xiamen, Peoples R China

来源：

ARTIFICIAL INTELLIGENCE | 2023年 / 323卷

基金：

中国国家自然科学基金;

关键词：

Multi -modal neural machine translation; Graph neural networks; Contrastive learning;

D O I：

10.1016/j.artint.2023.103986

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

As an important extension of conventional text-only neural machine translation (NMT), multi-modal neural machine translation (MNMT) aims to translate input source sentences paired with images into the target language. Although a lot of MNMT models have been proposed to perform multi-modal semantic fusion, they do not consider fine-grained semantic correspondences between semantic units of different modalities (i.e., words and visual objects), which can be exploited to refine multi-modal representation learning via fine-grained semantic interactions. To address this issue, we propose a graph-based multi -modal fusion encoder for NMT. Concretely, we first employ a unified multi-modal graph to represent the input sentence and image, in which the multi-modal semantic units are considered as the nodes in the graph, connected by two kinds of edges with different semantic relationships. Then, we stack multiple graph-based multi-modal fusion layers that iteratively conduct intra-and inter-modal interactions to learn node representations. Finally, via an attention mechanism, we induce a multi-modal context from the top node representations for the decoder. Particularly, we introduce a progressive contrastive learning strategy based on the multi-modal graph to refine the training of our proposed model, where hard negative samples are introduced gradually. To evaluate our model, we conduct experiments on commonly-used datasets. Experimental results and analysis show that our MNMT model obtains significant improvements over competitive baselines, achieving state-of-the-art performance on the Multi30K dataset.& COPY; 2023 Elsevier B.V. All rights reserved.

引用

页数：14

共 57 条

[1]

Barrault Loic, 2018, P 3 C MACH TRANSL SH, V2, P308

[2]

Beck D, 2018, PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, P273

[3]

Caglayan O, 2016, Arxiv, DOI arXiv:1609.03976

[4]

Caglayan O, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4159

[5]

Caglayan Ozan, 2017, P 2 C MACH TRANSL AS, P432, DOI DOI 10.18653/V1/W17-4746

[6]

Calixto I, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P6392

[7] Doubly-Attentive Decoder for Multi-modal Neural Machine Translation [J].

Calixto, Iacer ;

Liu, Qun ;

Campbell, Nick .

PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, :1913-1924

[8]

Calixto Iacer, 2017, P 2017 C EMPIRICAL M, P992

[9]

Chen T, 2020, PR MACH LEARN RES, V119

[10]

Delbrouck J.B., 2017, P 2017 C EMPIRICAL M, P910

← 1 2 3 4 5 6 →