Conditionally Learn to Pay Attention for Sequential Visual Task

被引:0
作者
He, Jun [1 ]
Cao, Quan-Jie [1 ]
Zhang, Lei [2 ]
Tao, Hui [1 ]
机构
[1] Nanjing Univ Informat Sci & Technol, Sch Artificial Intelligence, Nanjing 210044, Peoples R China
[2] Nanjing Normal Univ, Sch Elect & Automat Engn, Nanjing 210023, Peoples R China
来源
IEEE ACCESS | 2020年 / 8卷
基金
中国国家自然科学基金;
关键词
Visualization; Task analysis; Object recognition; Object segmentation; Feature extraction; Adaptation models; Convolutional neural networks; Attention learning; weakly supervised learning; multiple objects recognition; image caption;
D O I
10.1109/ACCESS.2020.2982571
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Sequential visual task usually requires to pay attention to its current interested object conditional on its previous observations. Different from popular soft attention mechanism, we propose a new attention framework by introducing a novel <italic>conditional global feature</italic> which represents the weak feature descriptor of the current focused object. Specifically, for a standard CNN (Convolutional Neural Network) pipeline, the convolutional layers with different receptive fields are used to produce the attention maps by measuring how the convolutional features align to the <italic>conditional global feature</italic>. The <italic>conditional global feature</italic> can be generated by different recurrent structure according to different visual tasks, such as a simple recurrent neural network for multiple objects recognition, or a moderate complex language model for image captioning and weakly supervised multiple objects segmentation. Experiments show that our proposed conditional attention model achieves the best performance on the SVHN (Street View House Numbers) dataset with / without extra bounding box; for image caption, our attention model generates better scores than the popular soft attention model; and for weakly supervised multiple objects segmentation, by simply describing a sentence for each image, our attention model can segment the salient regions corresponding to the meaningful noun words.
引用
收藏
页码:56695 / 56710
页数:16
相关论文
共 35 条
  • [1] Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
    Anderson, Peter
    Wu, Qi
    Teney, Damien
    Bruce, Jake
    Johnson, Mark
    Sunderhauf, Niko
    Reid, Ian
    Gould, Stephen
    van den Hengel, Anton
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 3674 - 3683
  • [2] [Anonymous], 1997, Neural Computation
  • [3] Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, DOI 10.48550/ARXIV.1409.0473]
  • [4] Look and Think Twice: Capturing Top-Down Visual Attention with Feedback Convolutional Neural Networks
    Cao, Chunshui
    Liu, Xianming
    Yang, Yi
    Yu, Yinan
    Wang, Jiang
    Wang, Zilei
    Huang, Yongzhen
    Wang, Liang
    Huang, Chang
    Xu, Wei
    Ramanan, Deva
    Huang, Thomas S.
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2956 - 2964
  • [5] SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning
    Chen, Long
    Zhang, Hanwang
    Xiao, Jun
    Nie, Liqiang
    Shao, Jian
    Liu, Wei
    Chua, Tat-Seng
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 6298 - 6306
  • [6] Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions
    Cornia, Marcella
    Baraldi, Lorenzo
    Cucchiara, Rita
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 8299 - 8308
  • [7] DENKOWSKI M., 2014, P 9 WORKSH STAT MACH
  • [8] Generative Adversarial Networks
    Goodfellow, Ian
    Pouget-Abadie, Jean
    Mirza, Mehdi
    Xu, Bing
    Warde-Farley, David
    Ozair, Sherjil
    Courville, Aaron
    Bengio, Yoshua
    [J]. COMMUNICATIONS OF THE ACM, 2020, 63 (11) : 139 - 144
  • [9] Graves A, 2013, INT CONF ACOUST SPEE, P6645, DOI 10.1109/ICASSP.2013.6638947
  • [10] Jaderberg M, 2015, ADV NEUR IN, V28