Bottom-up and Top-down Object Inference Networks for Image Captioning

被引：4

作者：

Pan, Yingwei ^{[1
]}

Li, Yehao ^{[1
]}

Yao, Ting ^{[1
]}

Mei, Tao ^{[1
]}

机构：

[1] JD AI Res, 8 Beichen West St, Beijing 100105, Peoples R China

来源：

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS | 2023年 / 19卷 / 05期

基金：

国家重点研发计划;

关键词：

Image captioning; attention mechanism; cross-modal reasoning; LANGUAGE;

D O I：

10.1145/3580366

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

A bottom-up and top-down attention mechanism has led to the revolutionizing of image captioning techniques, which enables object-level attention for multi-step reasoning over all the detected objects. However, when humans describe an image, they often apply their own subjective experience to focus on only a few salient objects that areworthy of mention, rather than all objects in this image. The focused objects are further allocated in linguistic order, yielding the "object sequence of interest" to compose an enriched description. In this work, we present the Bottom-up and Top-down Object inference Network (BTO-Net), which novelly exploits the object sequence of interest as top-down signals to guide image captioning. Technically, conditioned on the bottom-up signals (all detected objects), an LSTM-based object inference module is first learned to produce the object sequence of interest, which acts as the top-down prior to mimic the subjective experience of humans. Next, both of the bottom-up and top-down signals are dynamically integrated via an attention mechanism for sentence generation. Furthermore, to prevent the cacophony of intermixed cross-modal signals, a contrastive learning-based objective is involved to restrict the interaction between bottom-up and top-down signals, and thus leads to reliable and explainable cross-modal reasoning. Our BTO-Net obtains competitive performances on the COCO benchmark, in particular, 134.1% CIDEr on the COCO Karpathy test split. Source code is available at https://github.com/YehLi/BTO-Net.

引用

页数：18

共 20 条

[1] Towards Explanatory Interactive Image Captioning Using Top-Down and Bottom-Up Features, Beam Search and Re-ranking
Rajarshi Biswas
Michael Barz
Daniel Sonntag
KI - Künstliche Intelligenz, 2020, 34 : 571 - 584
[2] Towards Explanatory Interactive Image Captioning Using Top-Down and Bottom-Up Features, Beam Search and Re-ranking
Biswas, Rajarshi
Barz, Michael
Sonntag, Daniel
KUNSTLICHE INTELLIGENZ, 2020, 34 (04): : 571 - 584
[3] Top-down and/or Bottom-up Causality: The Notion of Relatedness in the Human Brain
Wende, Kim C.
Jansen, Andreas
ADVANCES IN COGNITIVE NEURODYNAMICS (V), 2016, : 169 - 175
[4] Top-Down and Bottom-Up Strategies of Identity Construction in Ethnic Media
De Fina, Anna
APPLIED LINGUISTICS, 2013, 34 (05) : 554 - 573
[5] Top-down and bottom-up contributions to understanding sentences describing objects in motion
Rueschemeyer, Shirley-Ann
Glenberg, Arthur M.
Kaschak, Michael P.
Mueller, Karsten
Friederici, Angela D.
FRONTIERS IN PSYCHOLOGY, 2010, 1
[6] Children with dyslexia utilize both top-down and bottom-up networks equally in contextual and isolated word reading
Meri, Raya
Farah, Rola
Horowitz-Kraus, Tzipi
NEUROPSYCHOLOGIA, 2020, 147
[7] Time course of top-down and bottom-up influences on syllable processing in the auditory cortex
Bonte, M
Parviainen, T
Hytönen, K
Salmelin, R
CEREBRAL CORTEX, 2006, 16 (01) : 115 - 123
[8] Spatial-Temporal Bottom-Up Top-Down Attention Model for Action Recognition
Wang, Jinpeng
Ma, Andy J.
IMAGE AND GRAPHICS, ICIG 2019, PT I, 2019, 11901 : 81 - 92
[9] Temporal relation between top-down and bottom-up processing in lexical tone perception
Shuai, Lan
Gong, Tao
FRONTIERS IN BEHAVIORAL NEUROSCIENCE, 2014, 8
[10] "Hearing voices": Auditory hallucinations as failure of top-down control of bottom-up perceptual processes
Hugdahl, Kenneth
SCANDINAVIAN JOURNAL OF PSYCHOLOGY, 2009, 50 (06) : 553 - 560

← 1 2 →