Position-aware image captioning with spatial relation

被引：6

作者：

Duan, Yiqun ^{[1
]}

Wang, Zhen ^{[2
]}

Wang, Jingya ^{[3
]}

Wang, Yu-Kai ^{[1
]}

Lin, Chin-Teng ^{[1
]}

机构：

[1] Univ Technol Sydney, Australian Artificial Intelligence Inst, Sch Comp Sci, CIBCI Lab, Ultimo, NSW 2007, Australia

[2] Univ Sydney, Sch Comp Sci, Darlington, NSW 2008, Australia

[3] ShanghaiTech Univ, Shanghai Engn Res Ctr Intelligent Vis & Imaging, Sch Informat Sci & Technol, Shanghai 201210, Peoples R China

来源：

NEUROCOMPUTING | 2022年 / 497卷

关键词：

Deep learning; Vision & Language; Neural networks; Language generations; Transformer; Spatial relations;

D O I：

10.1016/j.neucom.2022.05.003

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Image caption aims to generate a language description of a given image. The problem can be solved by learning semantic information of visual objects and generating descriptions based on extracted embedding. However, the spatial relationship between visual objects and their static position is not fully explored by existing methods. In this work, we propose a Position-Aware Transformer (PAT) model that extracts both regional and static global visual features and unify both the regional and global by incorporating spatial information aligned to each visual feature. To make a better representation of spatial information and correlation between extracted visual features, we propose and compare three subtle approaches to explore position embedding with spatial relation information explicitly. Moreover, we jointly consider the static global and regional embedding for spatial modeling. Experimental results illustrate that our proposed model achieves competitive performance on the COCO image captioning dataset, where the PAT model could respectively reach 38.7, 28.6, and 58.6 on BLEU-4, METEOR, and ROUGE-L respectively. Extensive experiments suggest that the proposed PAT model could also reach competitive performance on related visual-language tasks including visual question answering (VQA) and multi modal retrieval. Detailed ablation studies are conducted to report how each part would contribute to the final performance, which could be a good reference for follow-up spatial information representation works.CO 2022 Published by Elsevier B.V.

引用

页码：28 / 38

页数：11

共 50 条

[21] Position-aware activity recognition with wearable devices
Sztyler, Timo
Stuckenschmidt, Heiner
Petrich, Wolfgang
PERVASIVE AND MOBILE COMPUTING, 2017, 38 : 281 - 295
[22] Position-Aware Attention Mechanism–Based Bi-graph for Dialogue Relation Extraction
Guiduo Duan
Yunrui Dong
Jiayu Miao
Tianxi Huang
Cognitive Computation, 2023, 15 : 359 - 372
[23] Relation-Aware Image Captioning for Explainable Visual Question Answering
Tseng, Ching-Shan
Lin, Ying-Jia
Kao, Hung-Yu
2022 INTERNATIONAL CONFERENCE ON TECHNOLOGIES AND APPLICATIONS OF ARTIFICIAL INTELLIGENCE, TAAI, 2022, : 149 - 154
[24] Position-Aware Safe Boundary Interpolation Oversampling
Liu, Yongxu
Liu, Yan
2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 5519 - 5526
[25] Spatial-Temporal Position-Aware Graph Convolution Networks for Traffic Flow Forecasting
Zhao, Yiji
Lin, Youfang
Wen, Haomin
Wei, Tonglong
Jin, Xiyuan
Wan, Huaiyu
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2023, 24 (08) : 8650 - 8666
[26] Position-aware Attention for Enhancing the Machine Comprehension
Liu, Weijie
Zhao, Jianbo
Li, Mingzheng
Li, Si
Guo, Jun
PROCEEDINGS OF 2018 INTERNATIONAL CONFERENCE ON NETWORK INFRASTRUCTURE AND DIGITAL CONTENT (IEEE IC-NIDC), 2018, : 20 - 24
[27] POSITION-AWARE ACTIVITY RECOGNITION ON MOBILE PHONES
Coskun, Doruk
Incel, Ozlem Durmaz
Ozgovde, Atay
2014 22ND SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2014, : 1930 - 1933
[28] Position-Aware Anti-Aliasing Filters for 3D Medical Image Analysis
Yu, Stanley T.
Zhou, Hong-Yu
IEEE ACCESS, 2022, 10 : 100151 - 100159
[29] Position-Aware Attention Mechanism-Based Bi-graph for Dialogue Relation Extraction
Duan, Guiduo
Dong, Yunrui
Miao, Jiayu
Huang, Tianxi
COGNITIVE COMPUTATION, 2023, 15 (01) : 359 - 372
[30] Position-aware multimedia mobile learning systems in museums
Chou, LD
Wu, CH
Ho, SP
Lee, CC
Proceedings of the IASTED International Conference on Web-Based Education, 2004, : 148 - 150

← 1 2 3 4 5 →