Multilevel Spatial-Temporal Feature Aggregation for Video Object Detection

被引：24

作者：

Xu, Chao ^{[1
]}

Zhang, Jiangning ^{[1
]}

Wang, Mengmeng ^{[1
]}

Tian, Guanzhong ^{[2
]}

Liu, Yong ^{[1
]}

机构：

[1] Zhejiang Univ, Inst Cyber Syst & Control, Hangzhou 310027, Peoples R China

[2] Zhejiang Univ, Ningbo Res Inst, Ningbo 315000, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2022年 / 32卷 / 11期

关键词：

Feature extraction; Proposals; Object detection; Optical flow; Detectors; Aggregates; Tracking; Video object detection; feature alignment; feature interaction; instance ID constraint; NETWORKS;

D O I：

10.1109/TCSVT.2022.3183646

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Video object detection (VOD) focuses on detecting objects for each frame in a video, which is a challenging task due to appearance deterioration in certain video frames. Recent works usually distill crucial information from multiple support frames to improve the reference features, but they only perform at frame level or proposal level that cannot integrate spatial-temporal features sufficiently. To deal with this challenge, we treat VOD as a spatial-temporal hierarchical features interacting process and introduce a Multi-level Spatial-Temporal (MST) feature aggregation framework to fully exploit frame-level, proposal-level, and instance-level information in a unified framework. Specifically, MST first measures context similarity in pixel space to enhance all frame-level features rather than only update reference features. The proposal-level feature aggregation then models object relation to augment reference object proposals. Furthermore, to filter out irrelevant information from other classes and backgrounds, we introduce an instance ID constraint to boost instance-level features by leveraging support object proposal features that belong to the same object. Besides, we propose a Deformable Feature Alignment (DAlign) module before MST to achieve a more accurate pixel-level spatial alignment for better feature aggregation. Extensive experiments are conducted on ImageNet VID and UAVDT datasets that demonstrate the superiority of our method over state-of-the-art (SOTA) methods. Our method achieves 83.3% and 62.1% with ResNet-101 on two datasets, outperforming SOTA MEGA by 0.4% and 2.7%.

引用

页码：7809 / 7820

页数：12

共 80 条

[1] Object Detection in Video with Spatiotemporal Sampling Networks [J].

Bertasius, Gedas ;

Torresani, Lorenzo ;

Shi, Jianbo .

COMPUTER VISION - ECCV 2018, PT XII, 2018, 11216 :342-357

[2] Cascade R-CNN: Delving into High Quality Object Detection [J].

Cai, Zhaowei ;

Vasconcelos, Nuno .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6154-6162

[3] EFFNet: Enhanced Feature Foreground Network for Video Smoke Source Prediction and Detection [J].

Cao, Yichao ;

Tang, Qingfei ;

Wu, Xuehui ;

Lu, Xiaobo .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (04) :1820-1833

[4] Optimizing Video Object Detection via a Scale-Time Lattice [J].

Chen, Kai ;

Wang, Jiaqi ;

Yang, Shuo ;

Zhang, Xingcheng ;

Xiong, Yuanjun ;

Loy, Chen Change ;

Lin, Dahua .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7814-7823

[5] Memory Enhanced Global-Local Aggregation for Video Object Detection [J].

Chen, Yihong ;

Cao, Yue ;

Hu, Han ;

Wang, Liwei .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10334-10343

[6]

Dai JF, 2023, Arxiv, DOI arXiv:1605.06409

[7] Deformable Convolutional Networks [J].

Dai, Jifeng ;

Qi, Haozhi ;

Xiong, Yuwen ;

Li, Yi ;

Zhang, Guodong ;

Hu, Han ;

Wei, Yichen .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :764-773

[8] Object Guided External Memory Network for Video Object Detection [J].

Deng, Hanming ;

Hua, Yang ;

Song, Tao ;

Zhang, Zongpu ;

Xue, Zhengui ;

Ma, Ruhui ;

Robertson, Neil ;

Guan, Haibing .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6677-6686

[9]

Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848

[10] MINet: Meta-Learning Instance Identifiers for Video Object Detection [J].

Deng, Jiajun ;

Pan, Yingwei ;

Yao, Ting ;

Zhou, Wengang ;

Li, Houqiang ;

Mei, Tao .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 :6879-6891

← 1 2 3 4 5 6 7 8 →