Embedded Heterogeneous Attention Transformer for Cross-Lingual Image Captioning

被引:1
作者
Song, Zijie [1 ]
Hu, Zhenzhen [1 ]
Zhou, Yuanen [2 ]
Zhao, Ye [1 ]
Hong, Richang [1 ]
Wang, Meng [1 ]
机构
[1] Hefei Univ Technol, Sch Comp Sci & Informat Engn, Hefei 230601, Peoples R China
[2] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, Hefei 230088, Peoples R China
关键词
Visualization; Task analysis; Transformers; Tensors; Semantics; Computational modeling; Cognition; Image captioning; cross-lingual learning; cross-model learning; heterogeneous attention reasoning; NETWORK;
D O I
10.1109/TMM.2024.3384678
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Cross-lingual image captioning is a challenging task that requires addressing both cross-lingual and cross-modal obstacles in multimedia analysis. The crucial issue in this task is to model the global and the local matching between the image and different languages. Existing cross-modal embedding methods based on the transformer architecture oversee the local matching between the image region and monolingual words, especially when dealing with diverse languages. To overcome these limitations, we propose an Embedded Heterogeneous Attention Transformer (EHAT) to establish cross-domain relationships and local correspondences between images and different languages by using a heterogeneous network. EHAT comprises Masked Heterogeneous Cross-attention (MHCA), Heterogeneous Attention Reasoning Network (HARN), and Heterogeneous Co-attention (HCA). The HARN serves as the core network and it captures cross-domain relationships by leveraging visual bounding box representation features to connect word features from two languages and to learn heterogeneous maps. MHCA and HCA facilitate cross-domain integration in the encoder through specialized heterogeneous attention mechanisms, enabling a single model to generate captions in two languages. We evaluate our approach on the MSCOCO dataset to generate captions in English and Chinese, two languages that exhibit significant differences in their language families. The experimental results demonstrate the superior performance of our method compared to existing advanced monolingual methods. Our proposed EHAT framework effectively addresses the challenges of cross-lingual image captioning, paving the way for improved multilingual image analysis and understanding.
引用
收藏
页码:9008 / 9020
页数:13
相关论文
共 88 条
  • [1] Aggarwal P., 2021, arXiv
  • [2] Alayrac JB, 2022, ADV NEUR IN
  • [3] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
    Anderson, Peter
    He, Xiaodong
    Buehler, Chris
    Teney, Damien
    Johnson, Mark
    Gould, Stephen
    Zhang, Lei
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
  • [4] Heterogeneous Graph Contrastive Learning Network for Personalized Micro-Video Recommendation
    Cai, Desheng
    Qian, Shengsheng
    Fang, Quan
    Hu, Jun
    Ding, Wenkui
    Xu, Changsheng
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2761 - 2773
  • [5] Heterogeneous Hierarchical Feature Aggregation Network for Personalized Micro-Video Recommendation
    Cai, Desheng
    Qian, Shengsheng
    Fang, Quan
    Xu, Changsheng
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 805 - 818
  • [6] Chen a, 2021, P 2 ACM INT C MULT A, P1
  • [7] Chen XL, 2015, Arxiv, DOI arXiv:1504.00325
  • [8] Cross-Lingual Text Image Recognition via Multi-Hierarchy Cross-Modal Mimic
    Chen, Zhuo
    Yin, Fei
    Yang, Qing
    Liu, Cheng-Lin
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 4830 - 4841
  • [9] Cornia M, 2020, PROC CVPR IEEE, P10575, DOI 10.1109/CVPR42600.2020.01059
  • [10] General Knowledge Embedded Image Representation Learning
    Cui, Peng
    Liu, Shaowei
    Zhu, Wenwu
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (01) : 198 - 207