Joint Spatial and Temporal Feature Enhancement Network for Disturbed Object Detection

被引:0
|
作者
Zhang, Fan [1 ,2 ]
Ji, Hongbing [1 ,2 ]
Zhang, Yongquan [1 ,2 ]
Zhu, Zhigang [1 ,2 ]
机构
[1] XIDIAN UNIV, Xian Key Lab Intelligent Spectrum Sensing & Inform, Xian 710071, Peoples R China
[2] XIDIAN UNIV, Shaanxi Union Res Ctr Univ & Enterprise Intelligen, Xian 710071, Peoples R China
基金
中国国家自然科学基金;
关键词
Feature extraction; Object detection; Semantics; Aggregates; Detectors; Proposals; Correlation; Video object detection; local-global context; deformable temporal sampling; temporal attention;
D O I
10.1109/TCSVT.2024.3432900
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Video object detection remains a challenging task due to appearance degradation in certain frames. Existing studies usually aggregate temporal information from multiple frames to enhance the object's appearance representation. Although significant detection performance has been achieved, there are still two shortcomings: (1) The spatial context information within each frame is not fully exploited, which can provide additional decision support when objects are corrupted; (2) In the feature alignment phase, traditional methods tend to employ one-to-one or one-to-global temporal alignment strategies, overlooking the local temporal correlation of objects. To address the above issues, we propose a Joint Spatial and Temporal Feature Enhancement Network (JSTFE-Net) for video object detection, which can jointly utilize spatial-temporal information. First, we present a novel local-global context enhancement module to effectively encode intra-frame spatial context information. This module can enhance the learning of both local details and global semantic information of objects, thereby facilitating accurate object perception within the spatial domain. Second, we develop a deformable temporal sampling module, which adaptively samples correlated temporal information according to the motion information between frames. In addition, to improve the aggregation of temporal-correlated sampled features from multiple frames, we devise an attention-based temporal aggregation block, which dynamically fuses these feature points based on their temporal similarity with the corresponding object feature point. Note that our JSTFE-Net can be effortlessly plugged into image object detectors and state-of-the-art video object detectors. Extensive experiments on the ImageNet VID dataset show that the proposed JSTFE-Net can consistently and significantly improve performance, demonstrating its effectiveness in video object detection.
引用
收藏
页码:12258 / 12273
页数:16
相关论文
共 50 条
  • [1] Multilevel Spatial-Temporal Feature Aggregation for Video Object Detection
    Xu, Chao
    Zhang, Jiangning
    Wang, Mengmeng
    Tian, Guanzhong
    Liu, Yong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (11) : 7809 - 7820
  • [2] Spatiotemporal Feature Enhancement Network for Blur Robust Underwater Object Detection
    Zhou, Hao
    Qi, Lu
    Huang, Hai
    Yang, Xu
    Yang, Jing
    IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, 2024, 16 (05) : 1814 - 1828
  • [3] SSFENET: SPATIAL AND SEMANTIC FEATURE ENHANCEMENT NETWORK FOR OBJECT DETECTION
    Wang, Tianyuan
    Ma, Can
    Su, Haoshan
    Wang, Weiping
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 1500 - 1504
  • [4] Class-Aware Feature Aggregation Network for Video Object Detection
    Han, Liang
    Wang, Pichao
    Yin, Zhaozheng
    Wang, Fan
    Li, Hao
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (12) : 8165 - 8178
  • [5] Feature Split-Merge-Enhancement Network for Remote Sensing Object Detection
    Ma, Wenping
    Li, Na
    Zhu, Hao
    Jiao, Licheng
    Tang, Xu
    Guo, Yuwei
    Hou, Biao
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
  • [6] SPATIAL-TEMPORAL FEATURE AGGREGATION NETWORK FOR VIDEO OBJECT DETECTION
    Chen, Zhu
    Li, Weihai
    Fei, Chi
    Liu, Bin
    Yu, Nenghai
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 1858 - 1862
  • [7] Deep Spatial-Temporal Joint Feature Representation for Video Object Detection
    Zhao, Baojun
    Zhao, Boya
    Tang, Linbo
    Han, Yuqi
    Wang, Wenzheng
    SENSORS, 2018, 18 (03)
  • [8] Infrared Maritime Object Detection Network With Feature Enhancement and Adjacent Fusion
    Zhang, Meng
    Dong, Lili
    Gao, Yulin
    Wang, Yichen
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2024, 17 : 5750 - 5760
  • [9] Temporal feature enhancement network with external memory for live-stream video object detection
    Fujitake, Masato
    Sugimoto, Akihiro
    PATTERN RECOGNITION, 2022, 131
  • [10] Latent Feature Pyramid Network for Object Detection
    Xie, Jin
    Pang, Yanwei
    Nie, Jing
    Cao, Jiale
    Han, Jungong
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2153 - 2163