Object-aware semantics of attention for image captioning

被引：1

作者：

Shiwei Wang

Long Lan

Xiang Zhang

Guohua Dong

Zhigang Luo

机构：

[1] National University of Defense Technology,Science and Technology on Parallel and distributed Processing

[2] National University of Defense Technology,Institute for Quantum Information, State Key Laboratory of High Performance Computing

[3] National University of Defense Technology,College of Computer

来源：

Multimedia Tools and Applications | 2020年 / 79卷

关键词：

High-level semantic concepts; Semantic attention; Image captioning;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

In image captioning, exploring the advanced semantic concepts is very important for boosting captioning performance. Although much progress has been made in this regard, most existing image captioning models usually neglect the interrelationships between objects in an image, which is a key factor of accurately understanding image content. In this paper, we propose the object-aware semantic attention object-aware semantic attention (OSA) based captioning model to address this issue. Specifically, our attention model allows the explicit associations between the objects by coupling the attention mechanism with three types of semantic concepts, i.e., the category information, relative sizes of the objects, and relative distances between objects. In reality, they are easily built up and seamlessly coupled with the well-known encoder-decoder captioning framework. In our empirical analysis, these semantic concepts favor different aspects of the image content like the number of the objects belonging to each category, the main focus of an image, and the closeness between the objects. Importantly, they are cooperated with visual features to help the attention model effectively highlight the image regions of interest for significant performance gains. By leveraging three types of semantic concepts, we derive four semantic attention models for image captioning. Extensive experiments on MSCOCO dataset show our attention models within the encoder-decoder image captioning framework perform favorably as compared to representative captioning models.

引用

页码：2013 / 2030

页数：17

共 30 条

[1] Chang X(2017)Feature interaction augmented sparse learning for fast kinect motion detection IEEE Trans Image Process 26 3911-3920
[2] Ma Z(2017)Semantic pooling for complex event analysis in untrimmed videos IEEE Trans Pattern Anal Mach Intell 39 1617-1632
[3] Lin M(2017)Fine-grained attention for image caption generation Multimedia Tools and Applications 77 2959-2971
[4] Yang Y(2016)Region-based convolutional networks for accurate object detection and segmentation IEEE Trans Pattern Anal Mach Intell 38 142-158
[5] Hauptmann AG(2019)Attentive long short-term preference modeling for personalized product search ACM Trans Inf Syst 37 19-2110
[6] Chang X(2017)Beyond trace ratio: weighted harmonic mean of trace ratios for multiclass discriminant analysis IEEE Trans Knowl Data Eng 29 2100-125
[7] Yu YL(2017)Hierarchical & multimodal video captioning: Discovering and transferring multimodal knowledge for vision to language Comput Vis Image Underst 163 113-undefined
[8] Yang Y(undefined)undefined undefined undefined undefined-undefined
[9] Xing EP(undefined)undefined undefined undefined undefined-undefined
[10] Chang Yan-Shuo(undefined)undefined undefined undefined undefined-undefined

← 1 2 3 →