Long-Term Recurrent Convolutional Networks for Visual Recognition and Description

被引：860

作者：

Donahue, Jeff ^{[1
]}

Hendricks, Lisa Anne ^{[1
]}

Rohrbach, Marcus ^{[1
,2
]}

Venugopalan, Subhashini ^{[3
]}

Guadarrama, Sergio ^{[1
]}

Saenko, Kate ^{[4
]}

Darrell, Trevor ^{[1
,2
]}

机构：

[1] Univ Calif Berkeley, Dept Elect Engn & Comp Sci, Berkeley, CA 94720 USA

[2] Int Comp Sci Inst, Berkeley, CA 94720 USA

[3] Univ Texas Austin, Dept Comp Sci, Austin, TX 78712 USA

[4] Univ Massachusetts Lowell, Dept Comp Sci, Lowell, MA 01852 USA

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2017年 / 39卷 / 04期

关键词：

Computer vision; convolutional nets; deep learning; transfer learning;

D O I：

10.1109/TPAMI.2016.2599174

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent are effective for tasks involving sequences, visual and otherwise. We describe a class of recurrent convolutional architectures which is end-to-end trainable and suitable for large-scale visual understanding tasks, and demonstrate the value of these models for activity recognition, image captioning, and video description. In contrast to previous models which assume a fixed visual representation or perform simple temporal averaging for sequential processing, recurrent convolutional models are "doubly deep" in that they learn compositional representations in space and time. Learning long-term dependencies is possible when nonlinearities are incorporated into the network state updates. Differentiable recurrent models are appealing in that they can directly map variable-length inputs (e.g., videos) to variable-length outputs (e.g., natural language text) and can model complex temporal dynamics; yet they can be optimized with backpropagation. Our recurrent sequence models are directly connected to modern visual convolutional network models and can be jointly trained to learn temporal dynamics and convolutional perceptual representations. Our results show that such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined or optimized.

引用

页码：677 / 691

页数：15

共 73 条

[41] Natural Language Object Retrieval [J].

Hu, Ronghang ;

Xu, Huazhe ;

Rohrbach, Marcus ;

Feng, Jiashi ;

Saenko, Kate ;

Darrell, Trevor .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :4555-4564

[42] Caffe: Convolutional Architecture for Fast Feature Embedding [J].

Jia, Yangqing ;

Shelhamer, Evan ;

Donahue, Jeff ;

Karayev, Sergey ;

Long, Jonathan ;

Girshick, Ross ;

Guadarrama, Sergio ;

Darrell, Trevor .

PROCEEDINGS OF THE 2014 ACM CONFERENCE ON MULTIMEDIA (MM'14), 2014, :675-678

[43]

Karpathy A, 2014, ADV NEUR IN, V27

[44]

Karpathy A, 2015, PROC CVPR IEEE, P3128, DOI 10.1109/CVPR.2015.7298932

[45] Large-scale Video Classification with Convolutional Neural Networks [J].

Karpathy, Andrej ;

Toderici, George ;

Shetty, Sanketh ;

Leung, Thomas ;

Sukthankar, Rahul ;

Fei-Fei, Li .

2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :1725-1732

[46]

Khan M. U. G., 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), P1480, DOI 10.1109/ICCVW.2011.6130425

[47]

Kiros R, 2014, PR MACH LEARN RES, V32, P595

[48]

Koehn P., 2007, ACL

[49] ImageNet Classification with Deep Convolutional Neural Networks [J].

Krizhevsky, Alex ;

Sutskever, Ilya ;

Hinton, Geoffrey E. .

COMMUNICATIONS OF THE ACM, 2017, 60 (06) :84-90

[50]

Kuehne H, 2011, IEEE I CONF COMP VIS, P2556, DOI 10.1109/ICCV.2011.6126543

← 1 2 3 4 5 6 7 8 →