RetrievalMMT: Retrieval-Constrained Multi-Modal Prompt Learning for Multi-Modal Machine Translation

被引：0

作者：

Wang, Yan ^{[1
]}

Zeng, Yawen ^{[2
]}

Liang, Junjie ^{[1
]}

Xing, Xiaofen ^{[1
]}

Xu, Jin ^{[1
]}

Xu, Xiangmin ^{[1
]}

机构：

[1] South China Univ Technol, Guangzhou, Peoples R China

[2] ByteDance AI Lab, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024 | 2024年

基金：

中国国家自然科学基金;

关键词：

multi-modal machine translation; multi-modal prompt learning; multi-modal dictionary;

D O I：

10.1145/3652583.3658018

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

As an extension of machine translation, the primary objective of multi-modal machine translation is to optimize the utilization of visual information. Technically, image information is integrated into multi-modal fusion and alignment as an auxiliary modality through concepts or latent semantics, which are typically based on the Transformer framework. However, current approaches often ignore one modality to design numerous handcrafted features (e.g. visual concept extraction) and require training of all parameters in their framework. Therefore, it is worthwhile to explore multi-modal concepts or features to enhance performance and an efficient approach to incorporate visual information with minimal cost. Meanwhile, with the development of multi-modal large language models (MLLMs), they are faced with the visual hallucination issue of compromising performance, despite their powerful capabilities. Inspired by pioneering techniques in the multi-modal field, such as prompt learning and MLLMs, this paper innovatively explores the possibility of applying multi-modal prompt learning to this multi-modal machine translation task. Our framework offers three key advantages: it establishes a robust connection between visual concepts and translation processes, requires a minimum of 1.46M parameters for training, and can be seamlessly integrated into any existing framework by retrieving a multi-modal dictionary. Specifically, we propose two prompt-guided strategies: a learnable prompt-refined module and a heuristic prompt-refined module. Among them, the learnable strategy utilizes off-the-shelf pre-trained models, while the heuristic strategy constrains the hallucination problem via concept retrieval. Our experiments on two real-world benchmark datasets demonstrate that our proposed method outperforms all competitors.

引用

页码：860 / 868

页数：9

共 50 条

[21] Multi-modal prompt learning with bidirectional layer-wise prompt fusion
Yin, Haitao
Zhao, Yumeng
INFORMATION FUSION, 2025, 117
[22] Multi-modal neural machine translation with deep semantic interactions
Su, Jinsong
Chen, Jinchang
Jiang, Hui
Zhou, Chulun
Lin, Huan
Ge, Yubin
Wu, Qingqiang
Lai, Yongxuan
INFORMATION SCIENCES, 2021, 554 : 47 - 60
[23] Visual Agreement Regularized Training for Multi-Modal Machine Translation
Yang, Pengcheng
Chen, Boxing
Zhang, Pei
Sun, Xu
THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 9418 - 9425
[24] Multi-modal graph contrastive encoding for neural machine translation
Yin, Yongjing
Zeng, Jiali
Su, Jinsong
Zhou, Chulun
Meng, Fandong
Zhou, Jie
Huang, Degen
Luo, Jiebo
ARTIFICIAL INTELLIGENCE, 2023, 323
[25] Multi-modal and cross-modal for lecture videos retrieval
Nhu Van Nguyen
Coustaty, Mickal
Ogier, Jean-Marc
2014 22ND INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2014, : 2667 - 2672
[26] Unsupervised Multi-modal Hashing for Cross-Modal Retrieval
Yu, Jun
Wu, Xiao-Jun
Zhang, Donglin
COGNITIVE COMPUTATION, 2022, 14 (03) : 1159 - 1171
[27] Unsupervised Multi-modal Hashing for Cross-Modal Retrieval
Jun Yu
Xiao-Jun Wu
Donglin Zhang
Cognitive Computation, 2022, 14 : 1159 - 1171
[28] Cross-Modal Retrieval Augmentation for Multi-Modal Classification
Gur, Shir
Neverova, Natalia
Stauffer, Chris
Lim, Ser-Nam
Kiela, Douwe
Reiter, Austin
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 111 - 123
[29] Multi-modal and multi-granular learning
Zhang, Bo
Zhang, Ling
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2007, 4426 : 9 - +
[30] Multi-modal semantic autoencoder for cross-modal retrieval
Wu, Yiling
Wang, Shuhui
Huang, Qingming
NEUROCOMPUTING, 2019, 331 : 165 - 175

← 1 2 3 4 5 →