Aligning Where to See and What to Tell: Image Captioning with Region-Based Attention and Scene-Specific Contexts

被引：126

作者：

Fu, Kun ^{[1
,2
]}

Jin, Junqi ^{[1
,2
]}

Cui, Runpeng ^{[2
]}

Sha, Fei ^{[3
,4
]}

Zhang, Changshui ^{[2
]}

机构：

[1] Univ Southern Calif, Los Angeles, CA 90089 USA

[2] Tsinghua Univ, Dept Automat, State Key Lab Intelligence Technol & Syst, Tsinghua Natl Lab Informat Sci & Technol, Beijing 100084, Peoples R China

[3] Univ Calif Los Angeles, Dept Comp Sci, Los Angeles, CA 90085 USA

[4] Univ Southern Calif, Dept Comp Sci, Los Angeles, CA 90089 USA

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2017年 / 39卷 / 12期

基金：

美国国家科学基金会;

关键词：

Image captioning; visual attention; scene-specific context; LSTM; GRADIENTS;

D O I：

10.1109/TPAMI.2016.2642953

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent progress on automatic generation of image captions has shown that it is possible to describe the most salient information conveyed by images with accurate and meaningful sentences. In this paper, we propose an image captioning system that exploits the parallel structures between images and sentences. In our model, the process of generating the next word, given the previously generated ones, is aligned with the visual perception experience where the attention shifts among the visual regions-such transitions impose a thread of ordering in visual perception. This alignment characterizes the flow of latent meaning, which encodes what is semantically shared by both the visual scene and the text description. Our system also makes another novel modeling contribution by introducing scene-specific contexts that capture higher-level semantic information encoded in an image. The contexts adapt language models for word generation to specific scene types. We benchmark our system and contrast to published results on several popular datasets, using both automatic evaluation metrics and human evaluation. We show that either region-based attention or scene-specific contexts improves systems without those components. Furthermore, combining these two modeling ingredients attains the state-of-the-art performance.

引用

页码：2321 / 2334

页数：14

共 41 条

[1]

[Anonymous], 2010, P NAACL HLT 2010 WOR

[2]

[Anonymous], 2012, Long Papers

[3]

[Anonymous], 2011, P 15 C COMP NAT LANG

[4]

[Anonymous], 2012, P 13 C EUR CHAPT ASS

[5]

[Anonymous], 2011, P 28 INT C MACH LEAR

[6]

[Anonymous], P IEEE C COMP VIS PA

[7]

[Anonymous], 2011, P 2011 C EMPIRICAL M

[8]

Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, DOI 10.48550/ARXIV.1409.0473]

[9]

Banerjee S., 2005, P ACL WORKSH INTR EX, P65

[10] Latent Dirichlet allocation [J].

Blei, DM ;

Ng, AY ;

Jordan, MI .

JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022

← 1 2 3 4 5 →