GSE: A global-local storage enhanced video object recognition model

被引:0
作者
Shi, Yuhong [1 ,2 ,3 ,4 ]
Pan, Hongguang [1 ,2 ]
Jiang, Ze [5 ]
Zhang, Libin
Miao, Rui [6 ]
Wang, Zheng [1 ,2 ]
Lei, Xinyu [4 ,5 ]
机构
[1] Xian Univ Sci & Technol, Coll Elect & Control Engn, Xian 710054, Peoples R China
[2] Xian Key Lab Elect Equipment Condit Monitoring & P, Xian 710054, Peoples R China
[3] Natl Key Lab Human Machine Hybrid Augmented Intell, Xian 710054, Peoples R China
[4] Xi An Jiao Tong Univ, Inst Artificial Intelligence & Robot, Xian 710054, Peoples R China
[5] CCTEG Changzhou Res Inst, Changzhou 213000, Peoples R China
[6] Shenzhen HQVT Technol Co Ltd, Shenzhen 518102, Peoples R China
关键词
Video object recognition; Multi-frame aggregation; Global-local storage; Cascading multi-head attention; NETWORK;
D O I
10.1016/j.neunet.2024.107109
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The presence of substantial similarities and redundant information within video data limits the performance of video object recognition models. To address this issue, a Global-Local Storage Enhanced video object recognition model (GSE) is proposed in this paper. Firstly, the model incorporates a two-stage dynamic multi-frame aggregation module to aggregate shallow frame features. This module aggregates features in batches from each input video using feature extraction, dynamic multi-frame aggregation, and centralized concatenations, significantly reducing the model's computational burden while retaining key information. In addition, a Global-Local Storage (GS) module is constructed to retain and utilize the information in the frame sequence effectively. This module classifies features using a temporal difference threshold method and employs a processing approach of inheritance, storage, and output to filter and retain features. By integrating global, local and key features, the model can accurately capture important temporal features when facing complex video scenes. Subsequently, a Cascaded Multi-head Attention (CMA) mechanism is designed. The multi-head cascade structure in this mechanism progressively focuses on object features and explores the correlations between key and global, local features. The differential step attention calculation is used to ensure computational efficiency. Finally, we optimize the model structure and adjust parameters, and verify the GSE model performance through comprehensive experiments. Experimental results on the ImageNet 2015 and NPSDrones datasets demonstrate that the GSE model achieves the highest mAP of 0.8352 and 0.8617, respectively. Compared with other models, the GSE model achieves a commendable balance across metrics such as precision, efficiency, and power consumption.
引用
收藏
页数:11
相关论文
共 48 条
[1]  
An S, 2023, Arxiv, DOI arXiv:2312.14492
[2]   Dogfight: Detecting Drones from Drones Videos [J].
Ashraf, Muhammad Waseem ;
Sultani, Waqas ;
Shah, Mubarak .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :7063-7072
[3]   Video Person Re-Identification Using Attribute-Enhanced Features [J].
Chai, Tianrui ;
Chen, Zhiyuan ;
Li, Annan ;
Chen, Jiaxin ;
Mei, Xinyu ;
Wang, Yunhong .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (11) :7951-7966
[4]   Memory Enhanced Global-Local Aggregation for Video Object Detection [J].
Chen, Yihong ;
Cao, Yue ;
Hu, Han ;
Wang, Liwei .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10334-10343
[5]   Class attention network for image recognition [J].
Cheng, Gong ;
Lai, Pujian ;
Gao, Decheng ;
Han, Junwei .
SCIENCE CHINA-INFORMATION SCIENCES, 2023, 66 (03)
[6]   Multi-feature based network for multivariate time series classification [J].
Du, Mingsen ;
Wei, Yanxuan ;
Zheng, Xiangwei ;
Ji, Cun .
INFORMATION SCIENCES, 2023, 639
[7]   Progressive Sparse Local Attention for Video Object Detection [J].
Guo, Chaoxu ;
Fan, Bin ;
Gu, Jie ;
Zhang, Qian ;
Xiang, Shiming ;
Prinet, Veronique ;
Pan, Chunhong .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :3908-3917
[8]   Global Memory and Local Continuity for Video Object Detection [J].
Han, Liang ;
Yin, Zhaozheng .
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 :3681-3693
[9]   Progressive Frame-Proposal Mining for Weakly Supervised Video Object Detection [J].
Han, Mingfei ;
Wang, Yali ;
Li, Mingjie ;
Chang, Xiaojun ;
Yang, Yi ;
Qiao, Yu .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 :1560-1573
[10]  
He F, 2022, AAAI CONF ARTIF INTE, P834