Object Guided External Memory Network for Video Object Detection

被引：92

作者：

Deng, Hanming ^{[1
]}

Hua, Yang ^{[2
]}

Song, Tao ^{[1
]}

Zhang, Zongpu ^{[1
]}

Xue, Zhengui ^{[1
]}

Ma, Ruhui ^{[1
]}

Robertson, Neil ^{[2
]}

Guan, Haibing ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, Shanghai, Peoples R China

[2] Queens Univ Belfast, Belfast, Antrim, North Ireland

来源：

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019) | 2019年

关键词：

D O I：

10.1109/ICCV.2019.00678

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video object detection is more challenging than image object detection because of the deteriorated frame quality. To enhance the feature representation, state-of-the-art methods propagate temporal information into the deteriorated frame by aligning and aggregating entire feature maps from multiple nearby frames. However, restricted by feature map's low storage-efficiency and vulnerable content-address allocation, long-term temporal information is not fully stressed by these methods. In this work, we propose the first object guided external memory network for online video object detection. Storage-efficiency is handled by object guided hard-attention to selectively store valuable features, and long-term information is protected when stored in an addressable external data matrix. A set of read/write operations are designed to accurately propagate/allocate and delete multi-level memory feature under object guidance. We evaluate our method on the ImageNet VID dataset and achieve state-of-the-art performance as well as good speed-accuracy tradeoff. Furthermore, by visualizing the external memory, we show the detailed object-level reasoning process across frames.

引用

页码：6677 / 6686

页数：10

共 45 条

[1]

[Anonymous], 2014, ECCV

[2]

Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473

[3] Object Detection in Video with Spatiotemporal Sampling Networks [J].

Bertasius, Gedas ;

Torresani, Lorenzo ;

Shi, Jianbo .

COMPUTER VISION - ECCV 2018, PT XII, 2018, 11216 :342-357

[4] Knowledge Aided Consistency for Weakly Supervised Phrase Grounding [J].

Chen, Kan ;

Gao, Jiyang ;

Nevatia, Ram .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4042-4050

[5]

Cho Kyunghyun, 2014, C EMPIRICAL METHODS, P1724

[6]

Dai J., 2016, ADV NEURAL INFORM PR, P379, DOI DOI 10.1109/CVPR.2017.690

[7] Deformable Convolutional Networks [J].

Dai, Jifeng ;

Qi, Haozhi ;

Xiong, Yuwen ;

Li, Yi ;

Zhang, Guodong ;

Hu, Han ;

Wei, Yichen .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :764-773

[8] FlowNet: Learning Optical Flow with Convolutional Networks [J].

Dosovitskiy, Alexey ;

Fischer, Philipp ;

Ilg, Eddy ;

Haeusser, Philip ;

Hazirbas, Caner ;

Golkov, Vladimir ;

van der Smagt, Patrick ;

Cremers, Daniel ;

Brox, Thomas .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2758-2766

[9] Spatiotemporal Multiplier Networks for Video Action Recognition [J].

Feichtenhofer, Christoph ;

Pinz, Axel ;

Wildes, Richard P. .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :7445-7454

[10]

Girshick R., 2015, P IEEE INT C COMPUTE, DOI [DOI 10.1109/ICCV.2015.169, 10.1109/ICCV.2015.169]

← 1 2 3 4 5 →