Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection

被引:89
作者
Beery, Sara [1 ,2 ]
Wu, Guanhang [2 ]
Rathod, Vivek [2 ]
Votel, Ronny [2 ]
Huang, Jonathan [2 ]
机构
[1] CALTECH, Pasadena, CA 91125 USA
[2] Google, Mountain View, CA 94043 USA
来源
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020) | 2020年
关键词
D O I
10.1109/CVPR42600.2020.01309
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In static monitoring cameras, useful contextual information can stretch far beyond the few seconds typical video understanding models might see: subjects may exhibit similar behavior over multiple days, and background objects remain static. Due to power and storage constraints, sampling frequencies are low, often no faster than one frame per second, and sometimes are irregular due to the use of a motion trigger. In order to perform well in this setting, models must be robust to irregular sampling rates. In this paper we propose a method that leverages temporal context from the unlabeled frames of a novel camera to improve performance at that camera. Specifically, we propose an attention-based approach that allows our model, Context R-CNN, to index into a long term memory bank constructed on a per-camera basis and aggregate contextual features from other frames to boost object detection performance on the current frame. We apply Context R-CNN to two settings: (1) species detection using camera traps, and (2) vehicle detection in traffic cameras, showing in both settings that Context R-CNN leads to performance gains over strong baselines. Moreover, we show that increasing the contextual time horizon leads to improved results. When applied to camera trap data from the Snapshot Serengeti dataset, Context R-CNN with context from up to a month of images outperforms a single-frame baseline by 17.9% mAP, and outperforms S3D (a 3d convolution based baseline) by 11.2% mAP.
引用
收藏
页码:13072 / 13082
页数:11
相关论文
共 56 条
[41]   The iNaturalist Species Classification and Detection Dataset [J].
Van Horn, Grant ;
Mac Aodha, Oisin ;
Song, Yang ;
Cui, Yin ;
Sun, Chen ;
Shepard, Alex ;
Adam, Hartwig ;
Perona, Pietro ;
Belongie, Serge .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :8769-8778
[42]  
Vaswani A, 2017, Advances in neural information processing systems, P5998, DOI [10.48550/arXiv.1706.03762, DOI 10.48550/ARXIV.1706.03762]
[43]   Deep Parametric Continuous Convolutional Neural Networks [J].
Wang, Shenlong ;
Suo, Simon ;
Ma, Wei-Chiu ;
Pokrovsky, Andrei ;
Urtasun, Raquel .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :2589-2597
[44]   Long-Term Feature Banks for Detailed Video Understanding [J].
Wu, Chao-Yuan ;
Feichtenhofer, Christoph ;
Fan, Haoqi ;
He, Kaiming ;
Krahenbuhl, Philipp ;
Girshick, Ross .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :284-293
[45]   Sequence Level Semantics Aggregation for Video Object Detection [J].
Wu, Haiping ;
Chen, Yuntao ;
Wang, Naiyan ;
Zhang, Zhaoxiang .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :9216-9224
[46]   Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification [J].
Xie, Saining ;
Sun, Chen ;
Huang, Jonathan ;
Tu, Zhuowen ;
Murphy, Kevin .
COMPUTER VISION - ECCV 2018, PT 15, 2018, 11219 :318-335
[47]  
Xiong Feng, 2017, SPATIOTEMPORAL MODEL, P5151
[48]   Automated identification of animal species in camera trap images [J].
Yu, Xiaoyuan ;
Wang, Jiangping ;
Kays, Roland ;
Jansen, Patrick A. ;
Wang, Tianjiang ;
Huang, Thomas .
EURASIP JOURNAL ON IMAGE AND VIDEO PROCESSING, 2013,
[49]   FCN-rLSTM: Deep Spatio-Temporal Neural Networks for Vehicle Counting in City Cameras [J].
Zhang, Shanghang ;
Wu, Guanhang ;
Costeira, Joao P. ;
Moura, Jose M. F. .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :3687-3696
[50]   Understanding Traffic Density from Large-Scale Web Camera Data [J].
Zhang, Shanghang ;
Wu, Guanhang ;
Costeira, Joao P. ;
Moura, Jose M. F. .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4264-4273