Bottom-up and Top-down Object Inference Networks for Image Captioning

被引:4
|
作者
Pan, Yingwei [1 ]
Li, Yehao [1 ]
Yao, Ting [1 ]
Mei, Tao [1 ]
机构
[1] JD AI Res, 8 Beichen West St, Beijing 100105, Peoples R China
基金
国家重点研发计划;
关键词
Image captioning; attention mechanism; cross-modal reasoning; LANGUAGE;
D O I
10.1145/3580366
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A bottom-up and top-down attention mechanism has led to the revolutionizing of image captioning techniques, which enables object-level attention for multi-step reasoning over all the detected objects. However, when humans describe an image, they often apply their own subjective experience to focus on only a few salient objects that areworthy of mention, rather than all objects in this image. The focused objects are further allocated in linguistic order, yielding the "object sequence of interest" to compose an enriched description. In this work, we present the Bottom-up and Top-down Object inference Network (BTO-Net), which novelly exploits the object sequence of interest as top-down signals to guide image captioning. Technically, conditioned on the bottom-up signals (all detected objects), an LSTM-based object inference module is first learned to produce the object sequence of interest, which acts as the top-down prior to mimic the subjective experience of humans. Next, both of the bottom-up and top-down signals are dynamically integrated via an attention mechanism for sentence generation. Furthermore, to prevent the cacophony of intermixed cross-modal signals, a contrastive learning-based objective is involved to restrict the interaction between bottom-up and top-down signals, and thus leads to reliable and explainable cross-modal reasoning. Our BTO-Net obtains competitive performances on the COCO benchmark, in particular, 134.1% CIDEr on the COCO Karpathy test split. Source code is available at https://github.com/YehLi/BTO-Net.
引用
收藏
页数:18
相关论文
共 20 条
  • [1] Towards Explanatory Interactive Image Captioning Using Top-Down and Bottom-Up Features, Beam Search and Re-ranking
    Rajarshi Biswas
    Michael Barz
    Daniel Sonntag
    KI - Künstliche Intelligenz, 2020, 34 : 571 - 584
  • [2] Towards Explanatory Interactive Image Captioning Using Top-Down and Bottom-Up Features, Beam Search and Re-ranking
    Biswas, Rajarshi
    Barz, Michael
    Sonntag, Daniel
    KUNSTLICHE INTELLIGENZ, 2020, 34 (04): : 571 - 584
  • [3] Top-down and/or Bottom-up Causality: The Notion of Relatedness in the Human Brain
    Wende, Kim C.
    Jansen, Andreas
    ADVANCES IN COGNITIVE NEURODYNAMICS (V), 2016, : 169 - 175
  • [4] Top-Down and Bottom-Up Strategies of Identity Construction in Ethnic Media
    De Fina, Anna
    APPLIED LINGUISTICS, 2013, 34 (05) : 554 - 573
  • [5] Top-down and bottom-up contributions to understanding sentences describing objects in motion
    Rueschemeyer, Shirley-Ann
    Glenberg, Arthur M.
    Kaschak, Michael P.
    Mueller, Karsten
    Friederici, Angela D.
    FRONTIERS IN PSYCHOLOGY, 2010, 1
  • [6] Children with dyslexia utilize both top-down and bottom-up networks equally in contextual and isolated word reading
    Meri, Raya
    Farah, Rola
    Horowitz-Kraus, Tzipi
    NEUROPSYCHOLOGIA, 2020, 147
  • [7] Time course of top-down and bottom-up influences on syllable processing in the auditory cortex
    Bonte, M
    Parviainen, T
    Hytönen, K
    Salmelin, R
    CEREBRAL CORTEX, 2006, 16 (01) : 115 - 123
  • [8] Spatial-Temporal Bottom-Up Top-Down Attention Model for Action Recognition
    Wang, Jinpeng
    Ma, Andy J.
    IMAGE AND GRAPHICS, ICIG 2019, PT I, 2019, 11901 : 81 - 92
  • [9] Temporal relation between top-down and bottom-up processing in lexical tone perception
    Shuai, Lan
    Gong, Tao
    FRONTIERS IN BEHAVIORAL NEUROSCIENCE, 2014, 8