Conditionally Learn to Pay Attention for Sequential Visual Task

被引：0

作者：

He, Jun ^{[1
]}

Cao, Quan-Jie ^{[1
]}

Zhang, Lei ^{[2
]}

Tao, Hui ^{[1
]}

机构：

[1] Nanjing Univ Informat Sci & Technol, Sch Artificial Intelligence, Nanjing 210044, Peoples R China

[2] Nanjing Normal Univ, Sch Elect & Automat Engn, Nanjing 210023, Peoples R China

来源：

IEEE ACCESS | 2020年 / 8卷

基金：

中国国家自然科学基金;

关键词：

Visualization; Task analysis; Object recognition; Object segmentation; Feature extraction; Adaptation models; Convolutional neural networks; Attention learning; weakly supervised learning; multiple objects recognition; image caption;

D O I：

10.1109/ACCESS.2020.2982571

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Sequential visual task usually requires to pay attention to its current interested object conditional on its previous observations. Different from popular soft attention mechanism, we propose a new attention framework by introducing a novel <italic>conditional global feature</italic> which represents the weak feature descriptor of the current focused object. Specifically, for a standard CNN (Convolutional Neural Network) pipeline, the convolutional layers with different receptive fields are used to produce the attention maps by measuring how the convolutional features align to the <italic>conditional global feature</italic>. The <italic>conditional global feature</italic> can be generated by different recurrent structure according to different visual tasks, such as a simple recurrent neural network for multiple objects recognition, or a moderate complex language model for image captioning and weakly supervised multiple objects segmentation. Experiments show that our proposed conditional attention model achieves the best performance on the SVHN (Street View House Numbers) dataset with / without extra bounding box; for image caption, our attention model generates better scores than the popular soft attention model; and for weakly supervised multiple objects segmentation, by simply describing a sentence for each image, our attention model can segment the salient regions corresponding to the meaningful noun words.

引用

页码：56695 / 56710

页数：16

共 35 条

[1] Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
Anderson, Peter
Wu, Qi
Teney, Damien
Bruce, Jake
Johnson, Mark
Sunderhauf, Niko
Reid, Ian
Gould, Stephen
van den Hengel, Anton
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 3674 - 3683
[2] [Anonymous], 1997, Neural Computation
[3] Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, DOI 10.48550/ARXIV.1409.0473]
[4] Look and Think Twice: Capturing Top-Down Visual Attention with Feedback Convolutional Neural Networks
Cao, Chunshui
Liu, Xianming
Yang, Yi
Yu, Yinan
Wang, Jiang
Wang, Zilei
Huang, Yongzhen
Wang, Liang
Huang, Chang
Xu, Wei
Ramanan, Deva
Huang, Thomas S.
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2956 - 2964
[5] SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning
Chen, Long
Zhang, Hanwang
Xiao, Jun
Nie, Liqiang
Shao, Jian
Liu, Wei
Chua, Tat-Seng
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 6298 - 6306
[6] Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions
Cornia, Marcella
Baraldi, Lorenzo
Cucchiara, Rita
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 8299 - 8308
[7] DENKOWSKI M., 2014, P 9 WORKSH STAT MACH
[8] Generative Adversarial Networks
Goodfellow, Ian
Pouget-Abadie, Jean
Mirza, Mehdi
Xu, Bing
Warde-Farley, David
Ozair, Sherjil
Courville, Aaron
Bengio, Yoshua
[J]. COMMUNICATIONS OF THE ACM, 2020, 63 (11) : 139 - 144
[9] Graves A, 2013, INT CONF ACOUST SPEE, P6645, DOI 10.1109/ICASSP.2013.6638947
[10] Jaderberg M, 2015, ADV NEUR IN, V28

← 1 2 3 4 →