Intra- and Inter-Head Orthogonal Attention for Image Captioning

被引：0

作者：

Zhang, Xiaodan ^{[1
]}

Jia, Aozhe ^{[1
]}

Ji, Junzhong ^{[1
]}

Qu, Liangqiong ^{[2
]}

Ye, Qixiang ^{[3
]}

机构：

[1] Beijing Univ Technol, Coll Comp Sci, Beijing 100124, Peoples R China

[2] Univ Hong Kong, Sch Comp & Data Sci, Hong Kong, Peoples R China

[3] Univ Chinese Acad Sci, Sch Elect Elect & Commun Engn, Beijing 100049, Peoples R China

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2025年 / 34卷

基金：

中国国家自然科学基金;

关键词：

Head; Redundancy; Visualization; Decoding; Transformers; Feature extraction; Correlation; Accuracy; Optimization; Dogs; Image captioning; multi-head attention (MA); orthogonal constraint;

D O I：

10.1109/TIP.2025.3528216

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Multi-head attention (MA), which allows the model to jointly attend to crucial information from diverse representation subspaces through its heads, has yielded remarkable achievement in image captioning. However, there is no explicit mechanism to ensure MA attends to appropriate positions in diverse subspaces, resulting in overfocused attention for each head and redundancy between heads. In this paper, we propose a novel Intra- and Inter-Head Orthogonal Attention (I(2)OA) to efficiently improve MA in image captioning by introducing a concise orthogonal regularization to heads. Specifically, Intra-Head Orthogonal Attention enhances the attention learning of MA by introducing orthogonal constraint to each head, which decentralizes the object-centric attention to more comprehensive content-aware attention. Inter-Head Orthogonal Attention reduces the heads redundancy by applying orthogonal constraint between heads, which enlarges the diversity of representation subspaces and improves the representation ability for MA. Moreover, the proposed I(2)OA is flexible to combine with various multi-head attention based image captioning methods and improve the performances without increasing model complexity and parameters. Experiments on the MS COCO dataset demonstrate the effectiveness of the proposed model.

引用

页码：594 / 607

页数：14

共 70 条

[1] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Anderson, Peter
He, Xiaodong
Buehler, Chris
Teney, Damien
Johnson, Mark
Gould, Stephen
Zhang, Lei
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
[2] SPICE: Semantic Propositional Image Caption Evaluation
Anderson, Peter
Fernando, Basura
Johnson, Mark
Gould, Stephen
[J]. COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 : 382 - 398
[3] [Anonymous], 2015, P INT C LEARN REPR S
[4] Banerjee Satanjeev, 2005, ACL WORKSHOPS, P65
[5] Bansal N., 2018, P ADV NEUR INF PROC, P4266
[6] Bhojanapalli S, 2021, arXiv
[7] Chaorui Deng, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12358), P712, DOI 10.1007/978-3-030-58601-0_42
[8] SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning
Chen, Long
Zhang, Hanwang
Xiao, Jun
Nie, Liqiang
Shao, Jian
Liu, Wei
Chua, Tat-Seng
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 6298 - 6306
[9] Chen XL, 2015, Arxiv, DOI [arXiv:1504.00325, DOI 10.48550/ARXIV.1504.00325]
[10] Chen Z., 2024, arXiv, DOI arXiv:2404.16821

← 1 2 3 4 5 6 7 →