Spatial-temporal multiscale feature optimization based two-stream convolutional neural network for action recognition

被引:1
作者
Xia, Limin [1 ]
Fu, Weiye [1 ]
机构
[1] Cent South Univ, Sch Automat, Changsha 410083, Peoples R China
来源
CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS | 2024年 / 27卷 / 08期
基金
中国国家自然科学基金;
关键词
Action recognition; Two-stream network; Attention mechanism; Multiscale features;
D O I
10.1007/s10586-024-04553-w
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Human action recognition is one of the most challenging tasks in computer vision due to background noise interference and video frame redundancy. Therefore, we propose a two-stream Convolutional Neural Network based on Spatial-Temporal Multiscale Feature Optimization (ST-MFO). Specifically, multiscale features generated by a pyramid pooling network are combined with improved coordinate attention, which results in richer feature representation and reduces background noise interference. Meanwhile, we introduce density peak clustering based on a nonlinear kernel function, which can extract more representative key frames. To improve classification efficiency, we also assign varying degrees of attention to key frames through temporal attention. In addition, we propose an attention-based spatial-temporal information interaction module that optimizes temporal and spatial features with complementarity between temporal and spatial information. Experimental results on four benchmark video datasets show that ST-MFO achieves comparable or better performance than state-of-the-art methods.
引用
收藏
页码:11611 / 11626
页数:16
相关论文
共 60 条
[1]   ViViT: A Video Vision Transformer [J].
Arnab, Anurag ;
Dehghani, Mostafa ;
Heigold, Georg ;
Sun, Chen ;
Lucic, Mario ;
Schmid, Cordelia .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826
[2]  
Asim M, 2018, 2018 COL VIS COMP S, P1, DOI DOI 10.1109/CVCS.2018.8496473
[3]   Supervised distance metric learning through maximization of the Jeffrey divergence [J].
Bac Nguyen ;
Morell, Carlos ;
De Baets, Bernard .
PATTERN RECOGNITION, 2017, 64 :215-225
[4]  
Bertasius G, 2021, PR MACH LEARN RES, V139
[5]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[6]   Temporally Identity-Aware SSD With Attentional LSTM [J].
Chen, Xingyu ;
Yu, Junzhi ;
Wu, Zhengxing .
IEEE TRANSACTIONS ON CYBERNETICS, 2020, 50 (06) :2674-2686
[7]   Two-Stream Deep Fusion Network Based on VAE and CNN for Synthetic Aperture Radar Target Recognition [J].
Du, Lan ;
Li, Lu ;
Guo, Yuchen ;
Wang, Yan ;
Ren, Ke ;
Chen, Jian .
REMOTE SENSING, 2021, 13 (20)
[8]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497
[9]   Omni-Sourced Webly-Supervised Learning for Video Recognition [J].
Duan, Haodong ;
Zhao, Yue ;
Xiong, Yuanjun ;
Liu, Wentao ;
Lin, Dahua .
COMPUTER VISION - ECCV 2020, PT XV, 2020, 12360 :670-688
[10]   X3D: Expanding Architectures for Efficient Video Recognition [J].
Feichtenhofer, Christoph .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :200-210