Modal Contrastive Learning Based End-to-End Text Image Machine Translation

被引:0
|
作者
Ma, Cong [1 ,2 ]
Han, Xu [1 ,2 ]
Wu, Linghui [1 ,2 ]
Zhang, Yaping [1 ,2 ]
Zhao, Yang [1 ,2 ]
Zhou, Yu [1 ,2 ]
Zong, Chengqing [1 ,2 ]
机构
[1] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 100049, Peoples R China
[2] Chinese Acad Sci, Inst Automat, Beijing 100190, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Machine translation; Decoding; Semantics; Pipelines; Text recognition; Task analysis; Text image machine translation; contrastive learning; text image recognition; machine translation; RECOGNITION;
D O I
10.1109/TASLP.2023.3324540
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Text image machine translation (TIMT) aims at directly translating text in the source language embedded in images into the target language. Most existing systems follow the cascaded pipeline diagram from recognition to translation, which suffers from the problem of error propagation, parameter redundancy, and information reduction. The end-to-end model has the potential to alleviate these issues via bridging the recognition and translation models. However, the challenge is the data limitation and modality gap between text and image. In this paper, we propose a novel end-to-end model, namely Modal contrastive learning based End-to-end Text Image Machine Translation (METIMT), which alleviates these issues through end-to-end text image machine translation architecture and modal contrastive learning. Specifically, an image encoder is designed to encode images into the same feature space of corresponding text sentences, with the guidance of an intra-modal and inter-modal contrastive learning module. To further promote the research of text image machine translation, we have constructed one synthetic and two real-world datasets. Extensive experiments show that our lighter, faster model outperforms not only existing pipeline methods but also state-of-the-art end-to-end models on both synthetic and real-world evaluation sets. Our code and dataset will be released to the public.
引用
收藏
页码:2153 / 2165
页数:13
相关论文
共 50 条
  • [1] RTNet: An End-to-End Method for Handwritten Text Image Translation
    Su, Tonghua
    Liu, Shuchen
    Zhou, Shengjie
    DOCUMENT ANALYSIS AND RECOGNITION - ICDAR 2021, PT II, 2021, 12822 : 99 - 113
  • [2] Improving End-to-End Text Image Translation From the Auxiliary Text Translation Task
    Ma, Cong
    Zhang, Yaping
    Tu, Mei
    Han, Xu
    Wu, Linghui
    Zhao, Yang
    Zhou, Yu
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 1664 - 1670
  • [3] End-to-End Speech-to-Text Translation: A Survey
    Sethiya, Nivedita
    Maurya, Chandresh Kumar
    COMPUTER SPEECH AND LANGUAGE, 2025, 90
  • [4] End-to-End Network Intrusion Detection Based on Contrastive Learning
    Li, Longlong
    Lu, Yuliang
    Yang, Guozheng
    Yan, Xuehu
    SENSORS, 2024, 24 (07)
  • [5] MINTZAI: End-to-end Deep Learning for Speech Translation
    Etchegoyhen, Thierry
    Arzelus, Haritz
    Gete, Harritxu
    Alvarez, Aitor
    Hernaez, Inma
    Navas, Eva
    Gonzalez-Docasal, Ander
    Osacar, Jaime
    Benites, Edson
    Ellakuria, Igor
    Calonge, Eusebi
    Martin, Maite
    PROCESAMIENTO DEL LENGUAJE NATURAL, 2020, (65): : 97 - 100
  • [6] End-to-end entity-aware neural machine translation
    Xie, Shufang
    Xia, Yingce
    Wu, Lijun
    Huang, Yiqing
    Fan, Yang
    Qin, Tao
    MACHINE LEARNING, 2022, 111 (03) : 1181 - 1203
  • [7] End-to-end entity-aware neural machine translation
    Shufang Xie
    Yingce Xia
    Lijun Wu
    Yiqing Huang
    Yang Fan
    Tao Qin
    Machine Learning, 2022, 111 : 1181 - 1203
  • [8] Contrastive Learning for improving End-to-end Speaker Verification
    Tang, Yanxi
    Wang, Jianzong
    Qu, Xiaoyang
    Xiao, Jing
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [9] FREE: A Fast and Robust End-to-End Video Text Spotter
    Cheng, Zhanzhan
    Lu, Jing
    Zou, Baorui
    Qiao, Liang
    Xu, Yunlu
    Pu, Shiliang
    Niu, Yi
    Wu, Fei
    Zhou, Shuigeng
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 822 - 837
  • [10] Recognizing Multiple Text Sequences from an Image by Pure End-to-End Learning
    Xu, Zhenlong
    Zhou, Shuigeng
    Bai, Fan
    Cheng, Zhanzhan
    Niu, Yi
    Pu, Shiliang
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 7058 - 7065