Hierarchical LSTMs with Adaptive Attention for Visual Captioning

被引:216
作者
Gao, Lianli [1 ,2 ]
Li, Xiangpeng [1 ,2 ]
Song, Jingkuan [1 ,2 ]
Shen, Heng Tao [1 ,2 ]
机构
[1] Univ Elect Sci & Technol China, Future Media Ctr, Chengdu 611731, Peoples R China
[2] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 611731, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Feature extraction; Task analysis; Decoding; Adaptation models; Natural language processing; Video captioning; image captioning; adaptive attention; hierarchical structure; LANGUAGE; VISION; VIDEO;
D O I
10.1109/TPAMI.2019.2894139
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent progress has been made in using attention based encoder-decoder framework for image and video captioning. Most existing decoders apply the attention mechanism to every generated word including both visual words (e.g., "gun" and "shooting") and non-visual words (e.g., "the", "a"). However, these non-visual words can be easily predicted using natural language model without considering visual signals or attention. Imposing attention mechanism on non-visual words could mislead and decrease the overall performance of visual captioning. Furthermore, the hierarchy of LSTMs enables more complex representation of visual data, capturing information at different scales. Considering these issues, we propose a hierarchical LSTM with adaptive attention (hLSTMat) approach for image and video captioning. Specifically, the proposed framework utilizes the spatial or temporal attention for selecting specific regions or frames to predict the related words, while the adaptive attention is for deciding whether to depend on the visual information or the language context information. Also, a hierarchical LSTMs is designed to simultaneously consider both low-level visual information and high-level language context information to support the caption generation. We design the hLSTMat model as a general framework, and we first instantiate it for the task of video captioning. Then, we further instantiate our hLSTMarefine it and apply it to the imioning task. To demonstrate the effectiveness of our proposed framework, we test our method on both video and image captioning tasks. Experimental results show that our approach achieves the state-of-the-art performance for most of the evaluation metrics on both tasks. The effect of important components is also well exploited in the ablation study.
引用
收藏
页码:1112 / 1131
页数:20
相关论文
共 69 条
[1]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[2]   SPICE: Semantic Propositional Image Caption Evaluation [J].
Anderson, Peter ;
Fernando, Basura ;
Johnson, Mark ;
Gould, Stephen .
COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 :382-398
[3]  
[Anonymous], 2014, T ASSOC COMPUT LING
[4]  
[Anonymous], P C N AM ASS COMP LI
[5]  
[Anonymous], 2011, P 49 ANN M ASS COMPU
[6]  
[Anonymous], CORR
[7]  
Banerjee S., 2005, P ACL WORKSH INTR EX, P65
[8]   LEARNING LONG-TERM DEPENDENCIES WITH GRADIENT DESCENT IS DIFFICULT [J].
BENGIO, Y ;
SIMARD, P ;
FRASCONI, P .
IEEE TRANSACTIONS ON NEURAL NETWORKS, 1994, 5 (02) :157-166
[9]   SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning [J].
Chen, Long ;
Zhang, Hanwang ;
Xiao, Jun ;
Nie, Liqiang ;
Shao, Jian ;
Liu, Wei ;
Chua, Tat-Seng .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6298-6306
[10]   Describing Multimedia Content Using Attention-Based Encoder-Decoder Networks [J].
Cho, Kyunghyun ;
Courville, Aaron ;
Bengio, Yoshua .
IEEE TRANSACTIONS ON MULTIMEDIA, 2015, 17 (11) :1875-1886