Position-aware image captioning with spatial relation

被引:6
|
作者
Duan, Yiqun [1 ]
Wang, Zhen [2 ]
Wang, Jingya [3 ]
Wang, Yu-Kai [1 ]
Lin, Chin-Teng [1 ]
机构
[1] Univ Technol Sydney, Australian Artificial Intelligence Inst, Sch Comp Sci, CIBCI Lab, Ultimo, NSW 2007, Australia
[2] Univ Sydney, Sch Comp Sci, Darlington, NSW 2008, Australia
[3] ShanghaiTech Univ, Shanghai Engn Res Ctr Intelligent Vis & Imaging, Sch Informat Sci & Technol, Shanghai 201210, Peoples R China
关键词
Deep learning; Vision & Language; Neural networks; Language generations; Transformer; Spatial relations;
D O I
10.1016/j.neucom.2022.05.003
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image caption aims to generate a language description of a given image. The problem can be solved by learning semantic information of visual objects and generating descriptions based on extracted embedding. However, the spatial relationship between visual objects and their static position is not fully explored by existing methods. In this work, we propose a Position-Aware Transformer (PAT) model that extracts both regional and static global visual features and unify both the regional and global by incorporating spatial information aligned to each visual feature. To make a better representation of spatial information and correlation between extracted visual features, we propose and compare three subtle approaches to explore position embedding with spatial relation information explicitly. Moreover, we jointly consider the static global and regional embedding for spatial modeling. Experimental results illustrate that our proposed model achieves competitive performance on the COCO image captioning dataset, where the PAT model could respectively reach 38.7, 28.6, and 58.6 on BLEU-4, METEOR, and ROUGE-L respectively. Extensive experiments suggest that the proposed PAT model could also reach competitive performance on related visual-language tasks including visual question answering (VQA) and multi modal retrieval. Detailed ablation studies are conducted to report how each part would contribute to the final performance, which could be a good reference for follow-up spatial information representation works.CO 2022 Published by Elsevier B.V.
引用
收藏
页码:28 / 38
页数:11
相关论文
共 50 条
  • [31] Position-Aware Tagging for Aspect Sentiment Triplet Extraction
    Xu, Lu
    Li, Hao
    Lu, Wei
    Bing, Lidong
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 2339 - 2349
  • [32] PAS: A Position-Aware Similarity Measurement for Sequential Recommendation
    Zeng, Zijie
    Lin, Jing
    Pan, Weike
    Ming, Zhong
    Lu, Zhongqi
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [33] Position-Aware Relational Transformer for Knowledge Graph Embedding
    Li, Guangyao
    Sun, Zequn
    Hu, Wei
    Cheng, Gong
    Qu, Yuzhong
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (08) : 11580 - 11594
  • [34] A POSITION-AWARE LINEAR SOLID CONSTITUTIVE MODEL FOR PERIDYNAMICS
    Mitchell, John A.
    Silling, Stewart A.
    Littlewood, David J.
    JOURNAL OF MECHANICS OF MATERIALS AND STRUCTURES, 2015, 10 (05) : 539 - 557
  • [35] ISDA: POSITION-AWARE INSTANCE SEGMENTATION WITH DEFORMABLE ATTENTION
    Ying, Kaining
    Wang, Zhenhua
    Bai, Cong
    Zhou, Pengfei
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 2619 - 2623
  • [36] Position-aware compositional embeddings for compressed recommendation systems
    Mu, Zongshen
    Zhuang, Yueting
    Tang, Siliang
    NEUROCOMPUTING, 2024, 592
  • [37] Position-Aware ListMLE: A Sequential Learning Process for Ranking
    Lan, Yanyan
    Zhu, Yadong
    Guo, Jiafeng
    Niu, Shuzi
    Cheng, Xueqi
    UNCERTAINTY IN ARTIFICIAL INTELLIGENCE, 2014, : 449 - 458
  • [38] Uncertainty-Aware Image Captioning
    Fei, Zhengcong
    Fan, Mingyuan
    Zhu, Li
    Huang, Junshi
    Wei, Xiaoming
    Wei, Xiaolin
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 614 - 622
  • [39] Culturally-aware Image Captioning
    Yun, Youngsik
    PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 8520 - 8521
  • [40] Exploring Implicit and Explicit Relations with the Dual Relation-Aware Network for Image Captioning
    Zha, Zhiwei
    Zhou, Pengfei
    Bai, Cong
    MULTIMEDIA MODELING, MMM 2022, PT II, 2022, 13142 : 97 - 108