Joint Anchor-Feature Refinement for Real-Time Accurate Object Detection in Images and Videos

被引:44
作者
Chen, Xingyu [1 ,2 ]
Yu, Junzhi [1 ,3 ]
Kong, Shihan [1 ,2 ]
Wu, Zhengxing [1 ,2 ]
Wen, Li [4 ]
机构
[1] Chinese Acad Sci, Inst Automat, State Key Lab Management & Control Complex Syst, Beijing 100190, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 100049, Peoples R China
[3] Peking Univ, State Key Lab Turbulence & Complex Syst, Dept Mech & Engn Sci, BIC ESAT,Coll Engn, Beijing 100871, Peoples R China
[4] Beihang Univ, Sch Mech Engn & Automat, Beijing 100191, Peoples R China
基金
中国国家自然科学基金;
关键词
Feature extraction; Head; Object detection; Videos; Detectors; Real-time systems; Task analysis; neural networks; computer vision; deep learning;
D O I
10.1109/TCSVT.2020.2980876
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Object detection has been vigorously investigated for years but fast accurate detection for real-world scenes remains a very challenging problem. Overcoming drawbacks of single-stage detectors, we take aim at precisely detecting objects for static and temporal scenes in real time. Firstly, as a dual refinement mechanism, a novel anchor-offset detection is designed, which includes an anchor refinement, a feature location refinement, and a deformable detection head. This new detection mode is able to simultaneously perform two-step regression and capture accurate object features. Based on the anchor-offset detection, a dual refinement network (DRNet) is developed for high-performance static detection, where a multi-deformable head is further designed to leverage contextual information for describing objects. As for temporal detection in videos, temporal refinement networks (TRNet) and temporal dual refinement networks (TDRNet) are developed by propagating the refinement information across time. We also propose a soft refinement strategy to temporally match object motion with the previous refinement. Our proposed methods are evaluated on PASCAL VOC, COCO, and ImageNet VID datasets. Extensive comparisons on static and temporal detection verify the superiority of DRNet, TRNet, and TDRNet. Consequently, our developed approaches run in a fairly fast speed, and in the meantime achieve a significantly enhanced detection accuracy, i.e., 84.4% mAP on VOC 2007, 83.6% mAP on VOC 2012, 69.4% mAP on VID 2017, and 42.4% AP on COCO. Ultimately, producing encouraging results, our methods are applied to online underwater object detection and grasping with an autonomous system. Codes are publicly available at https://github.com/SeanChenxy/TDRN.
引用
收藏
页码:594 / 607
页数:14
相关论文
共 46 条
[1]  
[Anonymous], 2017, COMPUT RES REPOS
[2]  
[Anonymous], 2018, IEEE C COMPUTER VISI
[3]  
[Anonymous], J MACH LEARN RES
[4]  
[Anonymous], IN PRESS, DOI DOI 10.1109/TCYB.2019.2894261
[5]  
[Anonymous], 2016, P COMPUTER VISION EC, DOI DOI 10.1007/978-3-319-46448-0_2
[6]   Object Detection in Video with Spatiotemporal Sampling Networks [J].
Bertasius, Gedas ;
Torresani, Lorenzo ;
Shi, Jianbo .
COMPUTER VISION - ECCV 2018, PT XII, 2018, 11216 :342-357
[7]  
Chi C, 2019, AAAI CONF ARTIF INTE, P8231
[8]   Deformable Convolutional Networks [J].
Dai, Jifeng ;
Qi, Haozhi ;
Xiong, Yuwen ;
Li, Yi ;
Zhang, Guodong ;
Hu, Han ;
Wei, Yichen .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :764-773
[9]  
Dai J, 2016, PROCEEDINGS 2016 IEEE INTERNATIONAL CONFERENCE ON INDUSTRIAL TECHNOLOGY (ICIT), P1796, DOI 10.1109/ICIT.2016.7475036
[10]  
Everingham M., 2010, International journal of computer vision, V88, P303, DOI DOI 10.1007/s11263-009-0275-4