Embedded Heterogeneous Attention Transformer for Cross-Lingual Image Captioning

被引：1

作者：

Song, Zijie ^{[1
]}

Hu, Zhenzhen ^{[1
]}

Zhou, Yuanen ^{[2
]}

Zhao, Ye ^{[1
]}

Hong, Richang ^{[1
]}

Wang, Meng ^{[1
]}

机构：

[1] Hefei Univ Technol, Sch Comp Sci & Informat Engn, Hefei 230601, Peoples R China

[2] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, Hefei 230088, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2024年 / 26卷

关键词：

Visualization; Task analysis; Transformers; Tensors; Semantics; Computational modeling; Cognition; Image captioning; cross-lingual learning; cross-model learning; heterogeneous attention reasoning; NETWORK;

D O I：

10.1109/TMM.2024.3384678

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Cross-lingual image captioning is a challenging task that requires addressing both cross-lingual and cross-modal obstacles in multimedia analysis. The crucial issue in this task is to model the global and the local matching between the image and different languages. Existing cross-modal embedding methods based on the transformer architecture oversee the local matching between the image region and monolingual words, especially when dealing with diverse languages. To overcome these limitations, we propose an Embedded Heterogeneous Attention Transformer (EHAT) to establish cross-domain relationships and local correspondences between images and different languages by using a heterogeneous network. EHAT comprises Masked Heterogeneous Cross-attention (MHCA), Heterogeneous Attention Reasoning Network (HARN), and Heterogeneous Co-attention (HCA). The HARN serves as the core network and it captures cross-domain relationships by leveraging visual bounding box representation features to connect word features from two languages and to learn heterogeneous maps. MHCA and HCA facilitate cross-domain integration in the encoder through specialized heterogeneous attention mechanisms, enabling a single model to generate captions in two languages. We evaluate our approach on the MSCOCO dataset to generate captions in English and Chinese, two languages that exhibit significant differences in their language families. The experimental results demonstrate the superior performance of our method compared to existing advanced monolingual methods. Our proposed EHAT framework effectively addresses the challenges of cross-lingual image captioning, paving the way for improved multilingual image analysis and understanding.

引用

页码：9008 / 9020

页数：13

共 88 条

[1] Aggarwal P., 2021, arXiv
[2] Alayrac JB, 2022, ADV NEUR IN
[3] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Anderson, Peter
He, Xiaodong
Buehler, Chris
Teney, Damien
Johnson, Mark
Gould, Stephen
Zhang, Lei
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
[4] Heterogeneous Graph Contrastive Learning Network for Personalized Micro-Video Recommendation
Cai, Desheng
Qian, Shengsheng
Fang, Quan
Hu, Jun
Ding, Wenkui
Xu, Changsheng
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2761 - 2773
[5] Heterogeneous Hierarchical Feature Aggregation Network for Personalized Micro-Video Recommendation
Cai, Desheng
Qian, Shengsheng
Fang, Quan
Xu, Changsheng
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 805 - 818
[6] Chen a, 2021, P 2 ACM INT C MULT A, P1
[7] Chen XL, 2015, Arxiv, DOI arXiv:1504.00325
[8] Cross-Lingual Text Image Recognition via Multi-Hierarchy Cross-Modal Mimic
Chen, Zhuo
Yin, Fei
Yang, Qing
Liu, Cheng-Lin
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 4830 - 4841
[9] Cornia M, 2020, PROC CVPR IEEE, P10575, DOI 10.1109/CVPR42600.2020.01059
[10] General Knowledge Embedded Image Representation Learning
Cui, Peng
Liu, Shaowei
Zhu, Wenwu
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (01) : 198 - 207

← 1 2 3 4 5 6 7 8 9 →