Memory Enhanced Global-Local Aggregation for Video Object Detection

被引：256

作者：

Chen, Yihong ^{[1
,3
,4
]}

Cao, Yue ^{[3
]}

Hu, Han ^{[3
]}

Wang, Liwei ^{[1
,2
]}

机构：

[1] Peking Univ, Ctr Data Sci, Beijing, Peoples R China

[2] Peking Univ, Sch EECS, Key Lab Machine Percept, MOE, Beijing, Peoples R China

[3] Microsoft Res Asia, Beijing, Peoples R China

[4] Zhejiang Lab, Hangzhou, Peoples R China

来源：

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020) | 2020年

基金：

国家重点研发计划;

关键词：

D O I：

10.1109/CVPR42600.2020.01035

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

How do humans recognize an object in a piece of video? Due to the deteriorated quality of single frame, it may be hard for people to identify an occluded object in this frame by just utilizing information within one image. We argue that there are two important cues for humans to recognize objects in videos: the global semantic information and the local localization information. Recently, plenty of methods adopt the self-attention mechanisms to enhance the features in key frame with either global semantic information or local localization information. In this paper we introduce memory enhanced global-local aggregation (MEGA) network, which is among the first trials that takes full consideration of both global and local information. Furthermore, empowered by a novel and carefully-designed Long Range Memory (LRM) module, our proposed MEGA could enable the key frame to get access to much more content than any previous methods. Enhanced by these two sources of information, our method achieves state-of-the-art performance on ImageNet VID dataset. Code is available at https://github.com/Scalsol/mega.pytorch.

引用

页码：10334 / 10343

页数：10

共 37 条

[1]

[Anonymous], ADV NEUR IN

[2] Object Detection in Video with Spatiotemporal Sampling Networks [J].

Bertasius, Gedas ;

Torresani, Lorenzo ;

Shi, Jianbo .

COMPUTER VISION - ECCV 2018, PT XII, 2018, 11216 :342-357

[3]

Cao Yue, 2019, ICCV

[4] Optimizing Video Object Detection via a Scale-Time Lattice [J].

Chen, Kai ;

Wang, Jiaqi ;

Yang, Shuo ;

Zhang, Xingcheng ;

Xiong, Yuanjun ;

Loy, Chen Change ;

Lin, Dahua .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7814-7823

[5]

Dai ZH, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P2978

[6] Object Guided External Memory Network for Video Object Detection [J].

Deng, Hanming ;

Hua, Yang ;

Song, Tao ;

Zhang, Zongpu ;

Xue, Zhengui ;

Ma, Ruhui ;

Robertson, Neil ;

Guan, Haibing .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6677-6686

[7] Relation Distillation Networks for Video Object Detection [J].

Deng, Jiajun ;

Pan, Yingwei ;

Yao, Ting ;

Zhou, Wengang ;

Li, Houqiang ;

Mei, Tao .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :7022-7031

[8] FlowNet: Learning Optical Flow with Convolutional Networks [J].

Dosovitskiy, Alexey ;

Fischer, Philipp ;

Ilg, Eddy ;

Haeusser, Philip ;

Hazirbas, Caner ;

Golkov, Vladimir ;

van der Smagt, Patrick ;

Cremers, Daniel ;

Brox, Thomas .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2758-2766

[9] Detect to Track and Track to Detect [J].

Feichtenhofer, Christoph ;

Pinz, Axel ;

Zisserman, Andrew .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :3057-3065

[10] Rich feature hierarchies for accurate object detection and semantic segmentation [J].

Girshick, Ross ;

Donahue, Jeff ;

Darrell, Trevor ;

Malik, Jitendra .

2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :580-587

← 1 2 3 4 →