Temporal Context Enhanced Feature Aggregation for Video Object Detection

被引:0
作者
He, Fei [1 ,2 ]
Gao, Naiyu [1 ,2 ]
Li, Qiaozhe [1 ,2 ]
Du, Senyao [3 ]
Zhao, Xin [1 ,2 ]
Huang, Kaiqi [1 ,2 ,4 ]
机构
[1] Chinese Acad Sci, Inst Automat, CRISE, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
[3] Horizon Robot, Beijing, Peoples R China
[4] CAS Ctr Excellence Brain Sci & Intelligence Techn, Beijing, Peoples R China
来源
THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE | 2020年 / 34卷
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video object detection is a challenging task because of the presence of appearance deterioration in certain video frames. One typical solution is to aggregate neighboring features to enhance per-frame appearance features. However, such a method ignores the temporal relations between the aggregated frames, which is critical for improving video recognition accuracy. To handle the appearance deterioration problem, this paper proposes a temporal context enhanced network (TCENet) to exploit temporal context information by temporal aggregation for video object detection. To handle the displacement of the objects in videos, a novel DeformAlign module is proposed to align the spatial features from frame to frame. Instead of adopting a fixed-length window fusion strategy, a temporal stride predictor is proposed to adaptively select video frames for aggregation, which facilitates exploiting variable temporal information and requiring fewer video frames for aggregation to achieve better results. Our TCENet achieves state-of-the-art performance on the ImageNet VID dataset and has a faster runtime. Without bells-and-whistles, our TCENet achieves 80.3% mAP by only aggregating 3 frames.
引用
收藏
页码:10941 / 10948
页数:8
相关论文
共 35 条
  • [1] Object Detection in Video with Spatiotemporal Sampling Networks
    Bertasius, Gedas
    Torresani, Lorenzo
    Shi, Jianbo
    [J]. COMPUTER VISION - ECCV 2018, PT XII, 2018, 11216 : 342 - 357
  • [2] Knowledge Aided Consistency for Weakly Supervised Phrase Grounding
    Chen, Kan
    Gao, Jiyang
    Nevatia, Ram
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 4042 - 4050
  • [3] Dai J., 2016, P NIPS16 30 INT C NE, P379, DOI DOI 10.1109/CVPR.2017.690
  • [4] Deformable Convolutional Networks
    Dai, Jifeng
    Qi, Haozhi
    Xiong, Yuwen
    Li, Yi
    Zhang, Guodong
    Hu, Han
    Wei, Yichen
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 764 - 773
  • [5] Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
  • [6] Cascaded Pose Regression
    Dollar, Piotr
    Welinder, Peter
    Perona, Pietro
    [J]. 2010 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2010, : 1078 - 1085
  • [7] Detect to Track and Track to Detect
    Feichtenhofer, Christoph
    Pinz, Axel
    Zisserman, Andrew
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 3057 - 3065
  • [8] Fischer P., 2015, ICCV
  • [9] Girshick R, 2015, P IEEE INT C COMP VI, DOI [DOI 10.1109/ICCV.2015.169, 10.1109/ICCV.2015.169]
  • [10] Han W., 2016, SEQ NMS VIDEO OBJECT