Improving Intra- and Inter-Modality Visual Relation for Image Captioning

被引:14
|
作者
Wang, Yong [1 ,2 ,4 ]
Zhang, WenKai [1 ,3 ]
Liu, Qing [1 ,3 ]
Zhang, Zhengyuan [1 ,2 ,4 ]
Gao, Xin [1 ,3 ]
Sun, Xian [1 ,3 ]
机构
[1] Chinese Acad Sci, Aerosp Informat Res Inst, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Elect Elect & Commun Engn, Beijing, Peoples R China
[3] Chinese Acad Sci, Inst Elect, Key Lab Network Informat Syst Technol, Beijing, Peoples R China
[4] Univ Chinese Acad Sci, Beijing, Peoples R China
来源
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA | 2020年
关键词
Image Captioning; Intra- and Inter-Modality Visual Relation; Relation Enhanced Transformer Block; Visual Guided Alignment;
D O I
10.1145/3394171.3413877
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
It is widely shared that capturing relationships among multi-modality features would be helpful for representing and ultimately describing an image. In this paper, we present a novel Intra- and Inter-modality visual Relation Transformer to improve connections among visual features, termed (IRT)-R-2. Firstly, we propose Relation Enhanced Transformer Block (RETB) for image feature learning, which strengthens intra-modality visual relations among objects. Moreover, to bridge the gap between inter-modality feature representations, we align them explicitly via Visual Guided Alignment (VGA) module. Finally, an end-to-end formulation is adopted to train the whole model jointly. Experiments on the MS-COCO dataset show the effectiveness of our model, leading to improvements on all commonly used metrics on the "Karpathy" test split. Extensive ablation experiments are conducted for the comprehensive analysis of the proposed method.
引用
收藏
页码:4190 / 4198
页数:9
相关论文
共 4 条
  • [1] Intra- and Inter-Head Orthogonal Attention for Image Captioning
    Zhang, Xiaodan
    Jia, Aozhe
    Ji, Junzhong
    Qu, Liangqiong
    Ye, Qixiang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2025, 34 : 594 - 607
  • [2] Improving Visual Question Answering by Image Captioning
    Shao, Xiangjun
    Dong, Hongsong
    Wu, Guangsheng
    IEEE ACCESS, 2025, 13 : 46299 - 46311
  • [3] Improving Image Captioning through Visual and Semantic Mutual Promotion
    Zhang, Jing
    Xie, Yingshuai
    Liu, Xiaoqiang
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4716 - 4724
  • [4] Relation-Aware Image Captioning for Explainable Visual Question Answering
    Tseng, Ching-Shan
    Lin, Ying-Jia
    Kao, Hung-Yu
    2022 INTERNATIONAL CONFERENCE ON TECHNOLOGIES AND APPLICATIONS OF ARTIFICIAL INTELLIGENCE, TAAI, 2022, : 149 - 154