Global Memory and Local Continuity for Video Object Detection

被引:13
作者
Han, Liang [1 ]
Yin, Zhaozheng [1 ,2 ]
机构
[1] SUNY Stony Brook, Dept Comp Sci, Stony Brook, NY 11794 USA
[2] SUNY Stony Brook, Dept Biomed Informat, Stony Brook, NY 11794 USA
基金
美国国家科学基金会;
关键词
Feature extraction; Object detection; Detectors; Proposals; Target tracking; Signal processing algorithms; Costs; Video object detection; global memory bank; feature aggregation; local continuity; object tracker;
D O I
10.1109/TMM.2022.3164253
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
To deal with the challenges in video object detection (VOD), such as occlusion and motion blur, many state-of-the-art video object detectors adopt a feature aggregation module to encode the long-range contextual information to support the current frame. The main drawbacks of these detectors are three-folds: first, the frame-wise detection slows down the detection speed; second, the frame-wise detection usually ignores the local continuity of the objects in a video, resulting in temporal inconsistent detection; third, the feature aggregation module usually encodes temporal features either from a local video clip or a single video, without exploiting the features in other videos. In this work, we develop an online VOD algorithm, aiming at a balanced high-speed and high-accuracy, by exploiting the global memory and local continuity. In the algorithm, an effective and efficient global memory bank (GMB) is designed to deposit and update object class features, which enables us to exploit the support features in other videos to enhance object features in the current video frames. Besides, to further speed up the detection, we design an object tracker to perform object detection for non-key frames based on the detection results of the key frame by leveraging the local continuity property of the video. Considering the trade-off between detection accuracy and speed, the proposed framework achieves superior performance on the ImageNet VID dataset. Source codes will be released to the public via our GitHub website.
引用
收藏
页码:3681 / 3693
页数:13
相关论文
共 77 条
[1]   2D Pose-Based Real-Time Human Action Recognition With Occlusion-Handling [J].
Angelini, Federico ;
Fu, Zeyu ;
Long, Yang ;
Shao, Ling ;
Naqvi, Syed Mohsen .
IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (06) :1433-1446
[2]   Object Detection in Video with Spatiotemporal Sampling Networks [J].
Bertasius, Gedas ;
Torresani, Lorenzo ;
Shi, Jianbo .
COMPUTER VISION - ECCV 2018, PT XII, 2018, 11216 :342-357
[3]   Fully-Convolutional Siamese Networks for Object Tracking [J].
Bertinetto, Luca ;
Valmadre, Jack ;
Henriques, Joao F. ;
Vedaldi, Andrea ;
Torr, Philip H. S. .
COMPUTER VISION - ECCV 2016 WORKSHOPS, PT II, 2016, 9914 :850-865
[4]   Memory Matching Networks for One-Shot Image Recognition [J].
Cai, Qi ;
Pan, Yingwei ;
Yao, Ting ;
Yan, Chenggang ;
Mei, Tao .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4080-4088
[5]  
Carion N, 2020, European conference on computer vision, P213, DOI DOI 10.1007/978-3-030-58452-813
[6]   Adaptive Convolution for Object Detection [J].
Chen, Chunlin ;
Ling, Qiang .
IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (12) :3205-3217
[7]   Optimizing Video Object Detection via a Scale-Time Lattice [J].
Chen, Kai ;
Wang, Jiaqi ;
Yang, Shuo ;
Zhang, Xingcheng ;
Xiong, Yuanjun ;
Loy, Chen Change ;
Lin, Dahua .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7814-7823
[8]   Learning Linear Regression via Single-Convolutional Layer for Visual Object Tracking [J].
Chen, Kai ;
Tao, Wenbing .
IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (01) :86-97
[9]   Memory Enhanced Global-Local Aggregation for Video Object Detection [J].
Chen, Yihong ;
Cao, Yue ;
Hu, Han ;
Wang, Liwei .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10334-10343
[10]  
Chun-Han Yao, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12359), P160, DOI 10.1007/978-3-030-58568-6_10