Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-grained Image Recognition

被引：1005

作者：

Fu, Jianlong ^{[1
]}

Zheng, Heliang ^{[2
]}

Mei, Tao ^{[1
]}

机构：

[1] Microsoft Res, Beijing, Peoples R China

[2] Univ Sci & Technol China, Hefei, Anhui, Peoples R China

来源：

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017) | 2017年

关键词：

D O I：

10.1109/CVPR.2017.476

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recognizing fine-grained categories (e.g., bird species) is difficult due to the challenges of discriminative region localization and fine-grained feature learning. Existing approaches predominantly solve these challenges independently, while neglecting the fact that region detection and fine-grained feature learning are mutually correlated and thus can reinforce each other. In this paper, we propose a novel recurrent attention convolutional neural network (RA-CNN) which recursively learns discriminative region attention and region-based feature representation at multiple scales in a mutually reinforced way. The learning at each scale consists of a classification sub-network and an attention proposal sub-network (APN). The APN starts from full images, and iteratively generates region attention from coarse to fine by taking previous predictions as a reference, while a finer scale network takes as input an amplified attended region from previous scales in a recurrent way. The proposed RA-CNN is optimized by an intra-scale classification loss and an inter-scale ranking loss, to mutually learn accurate region attention and fine-grained representation. RA-CNN does not need bounding box/part annotations and can be trained end-to-end. We conduct comprehensive experiments and show that RA-CNN achieves the best performance in three fine-grained tasks, with relative accuracy gains of 3.3%, 3.7%, 3.8%, on CUB Birds, Stanford Dogs and Stanford Cars, respectively.

引用

页码：4476 / 4484

页数：9

共 35 条

[11]

[Anonymous], 2016, CVPR

[12]

[Anonymous], 2015, PROC 28 INT C NEURAL

[13]

CORTES C, 1995, MACH LEARN, V20, P273, DOI 10.1023/A:1022627411411

[14] Tagging Personal Photos with Transfer Deep Learning [J].

Fu, Jianlong ;

Mei, Tao ;

Yang, Kuiyuan ;

Lu, Hanqing ;

Rui, Yong .

PROCEEDINGS OF THE 24TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW 2015), 2015, :344-354

[15] Image Tag Refinement With View-Dependent Concept Representations [J].

Fu, Jianlong ;

Wang, Jinqiao ;

Rui, Yong ;

Wang, Xin-Jing ;

Mei, Tao ;

Lu, Hanqing .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2015, 25 (08) :1409-1422

[16] Fast R-CNN [J].

Girshick, Ross .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :1440-1448

[17] Deep Residual Learning for Image Recognition [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778

[18] Part-Stacked CNN for Fine-Grained Visual Categorization [J].

Huang, Shaoli ;

Xu, Zhe ;

Tao, Dacheng ;

Zhang, Ya .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1173-1182

[19]

Khosla A., 2011, ICCV WORKSH

[20] The Unreasonable Effectiveness of Noisy Data for Fine-Grained Recognition [J].

Krause, Jonathan ;

Sapp, Benjamin ;

Howard, Andrew ;

Zhou, Howard ;

Toshev, Alexander ;

Duerig, Tom ;

Philbin, James ;

Li Fei-Fei .

COMPUTER VISION - ECCV 2016, PT III, 2016, 9907 :301-320

← 1 2 3 4 →