Integrating Scene Semantic Knowledge into Image Captioning

被引：45

作者：

Wei, Haiyang ^{[1
]}

Li, Zhixin ^{[1
]}

Huang, Feicheng ^{[1
]}

Zhang, Canlong ^{[1
]}

Ma, Huifang ^{[2
]}

Shi, Zhongzhi ^{[3
]}

机构：

[1] Guangxi Normal Univ, Guangxi Key Lab Multisource Informat Min & Secur, 15 Yucai Rd, Guilin 541004, Guangxi, Peoples R China

[2] Northwest Normal Univ, Coll Comp Sci & Engn, 967 Anning East Rd, Lanzhou 730070, Gansu, Peoples R China

[3] Chinese Acad Sci, Inst Comp Technol, Key Lab Intelligent Informat Proc, 6 Kexueyuan South Rd, Beijing 100190, Peoples R China

来源：

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS | 2021年 / 17卷 / 02期

基金：

中国国家自然科学基金;

关键词：

Image captioning; attention mechanism; scene semantics; encoder-decoder framework;

D O I：

10.1145/3439734

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Most existing image captioning methods use only the visual information of the image to guide the generation of captions, lack the guidance of effective scene semantic information, and the current visual attention mechanism cannot adjust the focus intensity on the image. In this article, we first propose an improved visual attention model. At each timestep, we calculated the focus intensity coefficient of the attention mechanism through the context information of themodel, then automatically adjusted the focus intensity of the attention mechanism through the coefficient to extract more accurate visual information. In addition, we represented the scene semantic knowledge of the image through topic words related to the image scene, then added them to the language model. We used the attention mechanism to determine the visual information and scene semantic information that the model pays attention to at each timestep and combined them to enable the model to generate more accurate and scene-specific captions. Finally, we evaluated our model on Microsoft COCO (MSCOCO) and Flickr30k standard datasets. The experimental results show that our approach generates more accurate captions and outperforms many recent advanced models in various evaluation metrics.

引用

页数：22

共 40 条

[1] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].

Anderson, Peter ;

He, Xiaodong ;

Buehler, Chris ;

Teney, Damien ;

Johnson, Mark ;

Gould, Stephen ;

Zhang, Lei .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086

[2] SPICE: Semantic Propositional Image Caption Evaluation [J].

Anderson, Peter ;

Fernando, Basura ;

Johnson, Mark ;

Gould, Stephen .

COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 :382-398

[3]

Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473

[4]

Banerjee S., 2005, ACL WORKSHOP INTRINS, P65

[5] Latent Dirichlet allocation [J].

Blei, DM ;

Ng, AY ;

Jordan, MI .

JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022

[6] SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning [J].

Chen, Long ;

Zhang, Hanwang ;

Xiao, Jun ;

Nie, Liqiang ;

Shao, Jian ;

Liu, Wei ;

Chua, Tat-Seng .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6298-6306

[7]

Cho K., 2014, arXiv, DOI [10.3115/v1/D14-1179, DOI 10.3115/V1/D14-1179, 10.3115/v1/d14-1179]

[8] Towards Diverse and Natural Image Descriptions via a Conditional GAN [J].

Dai, Bo ;

Fidler, Sanja ;

Urtasun, Raquel ;

Lin, Dahua .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :2989-2998

[9]

Dai JF, 2016, ADV NEUR IN, V29

[10] Aligning Where to See and What to Tell: Image Captioning with Region-Based Attention and Scene-Specific Contexts [J].

Fu, Kun ;

Jin, Junqi ;

Cui, Runpeng ;

Sha, Fei ;

Zhang, Changshui .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (12) :2321-2334

← 1 2 3 4 →