MA-VLAD: a fine-grained local feature aggregation scheme for action recognition

被引:2
作者
Feng, Na [1 ]
Tang, Ying [1 ]
Song, Zikai [1 ]
Yu, Junqing [1 ]
Chen, Yi-Ping Phoebe [2 ]
Yang, Wei [1 ]
机构
[1] Huazhong Univ Sci & Technol, Sch Comp Sci & Technol, Wuhan 430074, Peoples R China
[2] La Trobe Univ, Dept Comp Sci & Informat Technol, Bundoora, Vic 3086, Australia
关键词
VLAD; Local feature aggregation; Attention; Action recognition;
D O I
10.1007/s00530-024-01341-9
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A recent trend in action recognition involves aggregating local features into a more compact representation to eliminate redundancy in video features while retaining essential components for recognition. An exemplary approach is NetVLAD and its variations, which learn cluster centers for local features and represent them as VLAD descriptors. However, these methods process multi-frame features in a generic and straightforward manner, while overlooking the intricate semantic shifts within consecutive frames. More specifically, they fail to acknowledge that a pivotal aspect of events/actions is the local dynamics of semantic entities. In this paper, we propose Multi-head Attention Modularized VLAD (MA-VLAD) for fine-grained semantic-inclination clustering of features, enhancing VLAD descriptors with a strong local focusing capability. Specifically, we utilize a multi-head mechanism to partition the input features along the channel dimension, and integrate it with the attention mechanism to conduct fine-grained clustering. Additionally, to consolidate temporal information for enhanced recognition, we utilize temporal position embeddings to address order-sensitive events/actions. Our MA-VLAD delivers more dependable video representations than some of the most widely used and potent methods. Extensive experiments on UCF101, HMDB51, and SoccerNet-v2 datasets demonstrate that our MA-VLAD achieves state-of-the-art performance, underscoring its effectiveness.
引用
收藏
页数:13
相关论文
共 59 条
[11]   Multiscale Vision Transformers [J].
Fan, Haoqi ;
Xiong, Bo ;
Mangalam, Karttikeya ;
Li, Yanghao ;
Yan, Zhicheng ;
Malik, Jitendra ;
Feichtenhofer, Christoph .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6804-6815
[12]   Temporally-Aware Feature Pooling for Action Spotting in Soccer Broadcasts [J].
Giancola, Silvio ;
Ghanem, Bernard .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, :4485-4494
[13]   ActionVLAD: Learning spatio-temporal aggregation for action classification [J].
Girdhar, Rohit ;
Ramanan, Deva ;
Gupta, Abhinav ;
Sivic, Josef ;
Russell, Bryan .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :3165-3174
[14]   CMT: Convolutional Neural Networks Meet Vision Transformers [J].
Guo, Jianyuan ;
Han, Kai ;
Wu, Han ;
Tang, Yehui ;
Chen, Xinghao ;
Wang, Yunhe ;
Xu, Chang .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, :12165-12175
[15]   Patch-NetVLAD: Multi-Scale Fusion of Locally-Global Descriptors for Place Recognition [J].
Hausler, Stephen ;
Garg, Sourav ;
Xu, Ming ;
Milford, Michael ;
Fischer, Tobias .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :14136-14147
[16]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[17]  
Hochreiter S, 1997, NEURAL COMPUT, V9, P1735, DOI [10.1162/neco.1997.9.8.1735, 10.1007/978-3-642-24797-2, 10.1162/neco.1997.9.1.1]
[18]  
Hu J., 2018, PROC IEEECVF C COMPU
[19]   LEARNING SPATIO-TEMPORAL REPRESENTATIONS WITH TEMPORAL SQUEEZE POOLING [J].
Huang, Guoxi ;
Bors, Adrian G. .
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, :2103-2107
[20]  
Ioffe Sergey, 2015, Proceedings of Machine Learning Research, V37, P448