Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection

被引：89

作者：

Beery, Sara ^{[1
,2
]}

Wu, Guanhang ^{[2
]}

Rathod, Vivek ^{[2
]}

Votel, Ronny ^{[2
]}

Huang, Jonathan ^{[2
]}

机构：

[1] CALTECH, Pasadena, CA 91125 USA

[2] Google, Mountain View, CA 94043 USA

来源：

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020) | 2020年

关键词：

D O I：

10.1109/CVPR42600.2020.01309

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In static monitoring cameras, useful contextual information can stretch far beyond the few seconds typical video understanding models might see: subjects may exhibit similar behavior over multiple days, and background objects remain static. Due to power and storage constraints, sampling frequencies are low, often no faster than one frame per second, and sometimes are irregular due to the use of a motion trigger. In order to perform well in this setting, models must be robust to irregular sampling rates. In this paper we propose a method that leverages temporal context from the unlabeled frames of a novel camera to improve performance at that camera. Specifically, we propose an attention-based approach that allows our model, Context R-CNN, to index into a long term memory bank constructed on a per-camera basis and aggregate contextual features from other frames to boost object detection performance on the current frame. We apply Context R-CNN to two settings: (1) species detection using camera traps, and (2) vehicle detection in traffic cameras, showing in both settings that Context R-CNN leads to performance gains over strong baselines. Moreover, we show that increasing the contextual time horizon leads to improved results. When applied to camera trap data from the Snapshot Serengeti dataset, Context R-CNN with context from up to a month of images outperforms a single-frame baseline by 17.9% mAP, and outperforms S3D (a 3d convolution based baseline) by 11.2% mAP.

引用

页码：13072 / 13082

页数：11

共 56 条

[1] Counting in the Wild [J].

Arteta, Carlos ;

Lempitsky, Victor ;

Zisserman, Andrew .

COMPUTER VISION - ECCV 2016, PT VII, 2016, 9911 :483-498

[2]

Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, DOI 10.48550/ARXIV.1409.0473]

[3]

Beery S., 2019, Biodiversity Information Science and Standards, V3, DOI DOI 10.3897/BISS.3.37222

[4]

Beery S., 2019, ARXIV

[5] Recognition in Terra Incognita [J].

Beery, Sara ;

Van Horn, Grant ;

Perona, Pietro .

COMPUTER VISION - ECCV 2018, PT XVI, 2018, 11220 :472-489

[6] Object Detection in Video with Spatiotemporal Sampling Networks [J].

Bertasius, Gedas ;

Torresani, Lorenzo ;

Shi, Jianbo .

COMPUTER VISION - ECCV 2018, PT XII, 2018, 11216 :342-357

[7] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

[8] Privacy preserving crowd monitoring: Counting people without people models or tracking [J].

Chan, Antoni B. ;

Liang, Zhang-Sheng John ;

Vasconcelos, Nuno .

2008 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOLS 1-12, 2008, :1766-1772

[9]

Dai J, 2016, P 30 INT C ADV NEUR, P379

[10] Object Guided External Memory Network for Video Object Detection [J].

Deng, Hanming ;

Hua, Yang ;

Song, Tao ;

Zhang, Zongpu ;

Xue, Zhengui ;

Ma, Ruhui ;

Robertson, Neil ;

Guan, Haibing .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6677-6686

← 1 2 3 4 5 6 →