Improving Intra- and Inter-Modality Visual Relation for Image Captioning

被引：14

作者：

Wang, Yong ^{[1
,2
,4
]}

Zhang, WenKai ^{[1
,3
]}

Liu, Qing ^{[1
,3
]}

Zhang, Zhengyuan ^{[1
,2
,4
]}

Gao, Xin ^{[1
,3
]}

Sun, Xian ^{[1
,3
]}

机构：

[1] Chinese Acad Sci, Aerosp Informat Res Inst, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Sch Elect Elect & Commun Engn, Beijing, Peoples R China

[3] Chinese Acad Sci, Inst Elect, Key Lab Network Informat Syst Technol, Beijing, Peoples R China

[4] Univ Chinese Acad Sci, Beijing, Peoples R China

来源：

MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA | 2020年

关键词：

Image Captioning; Intra- and Inter-Modality Visual Relation; Relation Enhanced Transformer Block; Visual Guided Alignment;

D O I：

10.1145/3394171.3413877

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

It is widely shared that capturing relationships among multi-modality features would be helpful for representing and ultimately describing an image. In this paper, we present a novel Intra- and Inter-modality visual Relation Transformer to improve connections among visual features, termed (IRT)-R-2. Firstly, we propose Relation Enhanced Transformer Block (RETB) for image feature learning, which strengthens intra-modality visual relations among objects. Moreover, to bridge the gap between inter-modality feature representations, we align them explicitly via Visual Guided Alignment (VGA) module. Finally, an end-to-end formulation is adopted to train the whole model jointly. Experiments on the MS-COCO dataset show the effectiveness of our model, leading to improvements on all commonly used metrics on the "Karpathy" test split. Extensive ablation experiments are conducted for the comprehensive analysis of the proposed method.

引用

页码：4190 / 4198

页数：9

共 4 条

[1] Intra- and Inter-Head Orthogonal Attention for Image Captioning
Zhang, Xiaodan
Jia, Aozhe
Ji, Junzhong
Qu, Liangqiong
Ye, Qixiang
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2025, 34 : 594 - 607
[2] Improving Visual Question Answering by Image Captioning
Shao, Xiangjun
Dong, Hongsong
Wu, Guangsheng
IEEE ACCESS, 2025, 13 : 46299 - 46311
[3] Improving Image Captioning through Visual and Semantic Mutual Promotion
Zhang, Jing
Xie, Yingshuai
Liu, Xiaoqiang
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4716 - 4724
[4] Relation-Aware Image Captioning for Explainable Visual Question Answering
Tseng, Ching-Shan
Lin, Ying-Jia
Kao, Hung-Yu
2022 INTERNATIONAL CONFERENCE ON TECHNOLOGIES AND APPLICATIONS OF ARTIFICIAL INTELLIGENCE, TAAI, 2022, : 149 - 154

← 1 →