MA-VLAD: a fine-grained local feature aggregation scheme for action recognition

被引:2
作者
Feng, Na [1 ]
Tang, Ying [1 ]
Song, Zikai [1 ]
Yu, Junqing [1 ]
Chen, Yi-Ping Phoebe [2 ]
Yang, Wei [1 ]
机构
[1] Huazhong Univ Sci & Technol, Sch Comp Sci & Technol, Wuhan 430074, Peoples R China
[2] La Trobe Univ, Dept Comp Sci & Informat Technol, Bundoora, Vic 3086, Australia
关键词
VLAD; Local feature aggregation; Attention; Action recognition;
D O I
10.1007/s00530-024-01341-9
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A recent trend in action recognition involves aggregating local features into a more compact representation to eliminate redundancy in video features while retaining essential components for recognition. An exemplary approach is NetVLAD and its variations, which learn cluster centers for local features and represent them as VLAD descriptors. However, these methods process multi-frame features in a generic and straightforward manner, while overlooking the intricate semantic shifts within consecutive frames. More specifically, they fail to acknowledge that a pivotal aspect of events/actions is the local dynamics of semantic entities. In this paper, we propose Multi-head Attention Modularized VLAD (MA-VLAD) for fine-grained semantic-inclination clustering of features, enhancing VLAD descriptors with a strong local focusing capability. Specifically, we utilize a multi-head mechanism to partition the input features along the channel dimension, and integrate it with the attention mechanism to conduct fine-grained clustering. Additionally, to consolidate temporal information for enhanced recognition, we utilize temporal position embeddings to address order-sensitive events/actions. Our MA-VLAD delivers more dependable video representations than some of the most widely used and potent methods. Extensive experiments on UCF101, HMDB51, and SoccerNet-v2 datasets demonstrate that our MA-VLAD achieves state-of-the-art performance, underscoring its effectiveness.
引用
收藏
页数:13
相关论文
共 59 条
[1]  
Arandjelovic R, 2018, IEEE T PATTERN ANAL, V40, P1437, DOI [10.1109/TPAMI.2017.2711011, 10.1109/CVPR.2016.572]
[2]   ViViT: A Video Vision Transformer [J].
Arnab, Anurag ;
Dehghani, Mostafa ;
Heigold, Georg ;
Sun, Chen ;
Lucic, Mario ;
Schmid, Cordelia .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826
[3]  
Bertasius G, 2021, PR MACH LEARN RES, V139
[4]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[5]   AGPN: Action Granularity Pyramid Network for Video Action Recognition [J].
Chen, Yatong ;
Ge, Hongwei ;
Liu, Yuxuan ;
Cai, Xinye ;
Sun, Liang .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (08) :3912-3923
[6]   A Context-Aware Loss Function for Action Spotting in Soccer Videos [J].
Cioppa, Anthony ;
Deliege, Adrien ;
Giancola, Silvio ;
Ghanem, Bernard ;
Van Droogenbroeck, Marc ;
Gade, Rikke ;
Moeslund, Thomas B. .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :13123-13133
[7]   SoccerNet-v2: A Dataset and Benchmarks for Holistic Understanding of Broadcast Soccer Videos [J].
Deliege, Adrien ;
Cioppa, Anthony ;
Giancola, Silvio ;
Seikavandi, Meisam J. ;
Dueholm, Jacob, V ;
Nasrollahi, Kamal ;
Ghanem, Bernard ;
Moeslund, Thomas B. ;
Van Droogenbroeck, Marc .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, :4503-4514
[8]  
Dosovitskiy A., 2021, arXiv
[9]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497
[10]   Spatio-Temporal Vector of Locally Max Pooled Features for Action Recognition in Videos [J].
Duta, Ionut Cosmin ;
Ionescu, Bogdan ;
Aizawa, Kiyoharu ;
Sebe, Nicu .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :3205-3214