Coarse-Fine Nested Network for Weakly Supervised Group Activity Recognition

被引：0

作者：

Ge, Xiaojing ^{[1
,2
]}

Yan, Rui ^{[3
]}

Shu, Xiangbo ^{[4
]}

Chen, Keke ^{[4
]}

Tian, Wei ^{[5
]}

Xie, Guo-Sen ^{[4
]}

机构：

[1] Nanjing Univ Sci & Technol, Sch Comp Sci & Engn, Nanjing 210094, Peoples R China

[2] Anhui Jianzhu Univ, Anhui Prov Key Lab Intelligent Bldg & Bldg Energy, Hefei 230009, Peoples R China

[3] Nanjing Univ, Dept Comp Sci & Technol, Nanjing 210023, Peoples R China

[4] Nanjing Univ Sci & Technol, Sch Comp Sci & Engn, Nanjing 210094, Peoples R China

[5] South Cent Minzu Univ, Sch Comp Sci, Wuhan 430074, Peoples R China

来源：

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS | 2025年 / 36卷 / 04期

基金：

中国博士后科学基金; 中国国家自然科学基金;

关键词：

Feature extraction; Visualization; Spatiotemporal phenomena; Transformers; Detectors; Activity recognition; Data mining; Attention; transformer; video understanding; weakly supervised group activity recognition (WSGAR);

D O I：

10.1109/TNNLS.2024.3401608

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Weakly supervised group activity recognition (WSGAR) aims at identifying the overall behavior of multiple persons without any fine-grained supervision information (including individual position and action label). Traditional methods usually adopt a person-to-whole way: detect persons via off-the-shelf detectors, obtain person-level features, and integrate into the group-level features for training the classifier. However, these methods are unflexible due to serious reliance on the quality of detectors. To get rid of the detector, recent works learn several prototype tokens from noisy grid features with learnable weights directly, which treat all the local visual information equally and bring in redundant and ambiguous information to some extent. To this end, we propose a novel coarse-fine nested network (CFNN) to coarsely localize the key visual patches of activity and further finely learn the local features, as well as the global features. Specifically, we design a nested interactor (NI) to progressively model the spatiotemporal interactions of the learnable global token. According to the cue of spatial interaction in NI, we localize several key visual patches via a new coarse-grained spatial localizer (CSL). Then, we finally encode these localized visual patches with the help of global spatiotemporal dependency via a new fine-grained spatiotemporal selector (FSS). Extensive experiments on Volleyball and NBA datasets demonstrate the effectiveness of the proposed CFNN compared with the existing competitive methods.

引用

页码：7103 / 7115

页数：13

共 58 条

[1] An Effective Video Transformer With Synchronized Spatiotemporal and Spatial Self-Attention for Action Recognition
Alfasly, Saghir
Chui, Charles K.
Jiang, Qingtang
Lu, Jian
Xu, Chen
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (02) : 2496 - 2509
[2] Sum Product Networks for Activity Recognition
Amer, Mohamed R.
Todorovic, Sinisa
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2016, 38 (04) : 800 - 813
[3] Monte Carlo Tree Search for Scheduling Activity Recognition
Amer, Mohamed R.
Todorovic, Sinisa
Fern, Alan
Zhu, Song-Chun
[J]. 2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2013, : 1353 - 1360
[4] Amer MR, 2012, LECT NOTES COMPUT SC, V7575, P187, DOI 10.1007/978-3-642-33765-9_14
[5] Amer MR, 2014, LECT NOTES COMPUT SC, V8694, P572, DOI 10.1007/978-3-319-10599-4_37
[6] ViViT: A Video Vision Transformer
Arnab, Anurag
Dehghani, Mostafa
Heigold, Georg
Sun, Chen
Lucic, Mario
Schmid, Cordelia
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 6816 - 6826
[7] Social Scene Understanding: End-to-End Multi-Person Action Localization and Collective Activity Recognition
Bagautdinov, Timur
Alahi, Alexandre
Fleuret, Francois
Fua, Pascal
Savarese, Silvio
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 3425 - 3434
[8] Structure Inference Machines: Recurrent Neural Networks for Analyzing Relations in Group Activity Recognition
Deng, Zhiwei
Vandat, Arash
Hu, Hexiang
Mori, Greg
[J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 4772 - 4781
[9] DongLi Wang, 2022, ICMLC 2022: 2022 14th International Conference on Machine Learning and Computing (ICMLC), P401, DOI 10.1145/3529836.3529899
[10] Dosovitskiy A., 2021, 9 INT C LEARN REPR I, DOI DOI 10.48550/ARXIV.2010.11929

← 1 2 3 4 5 6 →